Too many results in dismax queries with one word

2011-08-20 Thread RaVbaker
Hi all,

I have a database of e-commerce products (5M) and trying to build a search
solution for it.

I have used steemer, edgengram and doublemetaphone phonetic fields for
omiting common typos in queries.  It works quite good with dismax QParser
for queries longer than one word: "tv lc20", "sny psp 3001", "cannon 5d"
etc. For not having too many results I manipulated with `mm` parameter. But
when user type a single word like "ipad", "cannon". I always having a lot of
results (~6). This is unacceptable for my client. He would like to have
then only the `good` results. That particulary match specific query. It's
hard to acomplish for me cause of use doublemetaphone field which converts
words like "apt", "opt" and "ipad" and even "ipod" to the same phonetic word
- APT. And then all of these  words are matched fairly the same gives me
huge amount of results. Similar problems I have with other words like
"canon", "canine" and "cannon" which are KNN in phonetic way. But lexically
have different meanings: "canon" - camera, "canine" - cat food , "cannon" -
may be a misspell for canon or part of book title about cannon weapons.

My first idea was to make a second requestHandler without searching in
*_phonetic fields. And use it for queries with only one word. But it didn't
worked cause sometimes I want to correct user even if there is only one word
and suggest him something better. Query "cannon" is a good example. I'm
fairly sure that most of the time when someone type "cannon" it would be a
typo for "canon" and I want to show user also CANON cameras. That's why I
can't use second requestHandler for one word queries.

I'm looking for any ideas how could I change my requestHandler.

My regular queries are: http://localhost:8983/solr/select?q=cannon

Below I put my configuration for requestHandler and schema.xml.



solrconfig.xml:


   
*:*
 dismax
 
 title^1.3 title_text^0.9 title_phonetic^0.74 title_ng^0.17
 title_ngram^0.54
 producer_name^0.9 producer_name_text^0.89
 category_path_text^0.8 category_path_phonetic^0.65
 description^0.60 description_text^0.56
 
 title_text^1.1 title^1.2 description^0.3
 3
 0.1
 2<100% 3<-1 5<85%

 *,score




schema.xml:






































  



  


 
   

  







  


 


   

  

 

 

 
 











 

 
 
 

 



id
title



 





 
 



   







-- 
Rafał "RaVbaker" Piekarski.

web: http://ja.ravbaker.net
mail: ravba...@gmail.com
jid/xmpp/aim: ravba...@gmail.com
mobile: +48-663-808-481


Re: Update field value in the document based on value of another field in the document

2011-08-20 Thread bhawna singh
Now that I have set it up using UpdateProcessorChain, I am running into null
exeception.
Here is what I have-
In SolrConfig.xml


  
   
   
 


  

mychain




Here is my java code-
package mysolr;


import java.io.IOException;

import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.request.SolrQueryResponse;
import org.apache.solr.update.AddUpdateCommand;
import org.apache.solr.update.processor.UpdateRequestProcessor;
import org.apache.solr.update.processor.UpdateRequestProcessorFactory;

public class AddConditionalFieldsFactory extends
UpdateRequestProcessorFactory
{
  @Override
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
SolrQueryResponse rsp, UpdateRequestProcessor next)
  {
  System.out.println("From customization:");
return new AddConditionalFields(next);
  }
}

class AddConditionalFields extends UpdateRequestProcessor
{
  public AddConditionalFields( UpdateRequestProcessor next) {

super( next );
  }

  @Override
  public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();

Object v = doc.getFieldValue( "url" );
if( v != null ) {
  String url =  v.toString();
  if( url.contains("question") ) {
doc.addField( "tierFilter", "1" );
  }
}

// pass it up the chain
super.processAdd(cmd);
  }
}

Here is my Java code-
and I get the following error when I try to index-
Aug 20, 2011 10:48:43 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.AbstractMethodError  at
org.apache.solr.update.processor.UpdateRequestProcessorChain.createProcessor(UpdateRequestProcessorChain.java:74)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:53)


Any pointers please. I am using Solr 3.3

Thanks,
Bhawna

On Thu, Aug 18, 2011 at 2:04 PM, simon  wrote:

> An  UpdateRequestProcessor would do the trick. Look at the (rather minimal)
> documentation and code example in
> http://wiki.apache.org/solr/UpdateRequestProcessor
>
> -Simon
>
> On Thu, Aug 18, 2011 at 4:15 PM, bhawna singh 
> wrote:
>
> > Hi All,
> > I have a requirement to update a certain field value depending on the
> field
> > value of another field.
> > To elaborate-
> > I have a field called 'popularity' and a field called 'URL'. I need to
> > assign popularity value depending on the domain (URL) ( I have the
> > popularity and domain mapping in a text file).
> >
> > I am using CSVRequestHandler to import the data.
> >
> > What are the suggested ways to achieve this.
> > Your quick response is much appreciated.
> >
> > Thanks,
> > Bhawna
> >
>


Null exception in UpdateRequestProcessor

2011-08-20 Thread bhawna singh
Hi All,

I am trying to add a field in my Solr document based on the domain.
Now that I have set it up using UpdateProcessorChain, I am running into null
exception.
Here is what I have-
In SolrConfig.xml


  
   
   
 


  

mychain




Here is my java code-
package mysolr;


import java.io.IOException;

import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.request.SolrQueryResponse;
import org.apache.solr.update.AddUpdateCommand;
import org.apache.solr.update.processor.UpdateRequestProcessor;
import org.apache.solr.update.processor.UpdateRequestProcessorFactory;

public class AddConditionalFieldsFactory extends
UpdateRequestProcessorFactory
{
  @Override
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
SolrQueryResponse rsp, UpdateRequestProcessor next)
  {
  System.out.println("From customization:");
return new AddConditionalFields(next);
  }
}

class AddConditionalFields extends UpdateRequestProcessor
{
  public AddConditionalFields( UpdateRequestProcessor next) {

super( next );
  }

  @Override
  public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();

Object v = doc.getFieldValue( "url" );
if( v != null ) {
  String url =  v.toString();
  if( url.contains("question") ) {
doc.addField( "tierFilter", "1" );
  }
}

// pass it up the chain
super.processAdd(cmd);
  }
}

Here is my Java code-
and I get the following error when I try to index-
Aug 20, 2011 10:48:43 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.AbstractMethodError  at
org.apache.solr.update.processor.UpdateRequestProcessorChain.createProcessor(UpdateRequestProcessorChain.java:74)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:53)

Any pointers please, I am using Solr 3.3

Thanks,
Bhawna


heads up: re-index trunk Lucene/Solr indices

2011-08-20 Thread Michael McCandless
Hi,

I just committed a new block tree terms dictionary implementation,
which requires fully re-indexing any trunk indices.

See here for details:

https://issues.apache.org/jira/browse/LUCENE-3030

If you are using a released version of Lucene/Solr then you can ignore
this message.

Mike McCandless

http://blog.mikemccandless.com


Re: query cache result

2011-08-20 Thread jame vaalet
thanks tomas ..
can we set querywindowsize of particular query through url ?
say, i want only a particular set of query's result to be cached and not
other queries . is it possible to control this query cache results and
window size for each query separately ?


2011/8/19 Tomás Fernández Löbbe 

> From my understanding, seeing the cache as a set of key-value pairs, this
> cache has the query as key and the list of IDs resulting from the query as
> values. When the exact same query is issued, it will be found as key in
> this
> cache, and Solr will already have the list of IDs that match it.
> If you set the size of this cache to 50, that means that Solr will keep in
> memory the last 50 queries with their list of resulting document IDs.
>
> The number of IDs per query can be configured with the parameter
> queryResultWindowSize
> http://wiki.apache.org/solr/SolrCaching#queryResultWindowSize
>
> On Fri, Aug 19, 2011 at 10:34 AM, jame vaalet 
> wrote:
>
> > wiki says *"size
> >
> > The maximum number of entries in the cache."
> > andqueryResultCache
> >
> > This cache stores ordered sets of document IDs — the top N results of a
> > query ordered by some criteria.
> > *
> >
> > doesn't it mean number of document ids rather than number of queries ?
> >
> >
> >
> >
> >
> > 2011/8/19 Tomás Fernández Löbbe 
> >
> > > Hi Jame, the size for the queryResultCache is the number of queries
> that
> > > will fit into this cache. AutowarmCount is the number of queries that
> are
> > > going to be copyed from the old cache to the new cache when a commit
> > > occurrs
> > > (actually, the queries are going to be executed again agains the new
> > > IndexSearcher, as the results for them may have changed on the new
> > Index).
> > > initial size is the initial size of the array, it will start to grow
> from
> > > that size up to "size". You may want to see this page of the wiki:
> > > http://wiki.apache.org/solr/SolrCaching
> > >
> > > Regards,
> > >
> > > Tomás
> > > On Fri, Aug 19, 2011 at 8:39 AM, jame vaalet 
> > wrote:
> > >
> > > > hi,
> > > > i understand that queryResultCache tag in solrconfig is the one which
> > > > determines the cache size of SOLR in jvm.
> > > >
> > > >  > > > size="*${queryResultCacheSize:0}*"initialSize
> > > > ="*${queryResultCacheInitialSize:0}*" autowarmCount="*
> > > > ${queryResultCacheRows:0}*" />
> > > >
> > > >
> > > > out of the different attributes what is size? Is it the amount of
> > memory
> > > > reserved in bytes ? or number of doc ids cached ? or is it the number
> > of
> > > > queries it will cache?
> > > >
> > > > similarly wat is initial size and autowarm depicted in?
> > > >
> > > > can some please reply ...
> > > >
> > >
> >
> >
> >
> > --
> >
> > -JAME
> >
>



-- 

-JAME


Re: get update record from database using DIH

2011-08-20 Thread Alexandre Sompheng
Actually I requested  .../dataimport?command=delta-import&commit=true
And DIH in delta-import mode does not commit, you can se log below. My index
is quite empty, maybe 10 data rows max... It's just the beginning.


INFO: Starting Delta Import

Aug 14, 2011 1:42:02 AM org.apache.solr.core.SolrCore execute

INFO: [] webapp=/apache-solr-3.3.0 path=/dataimport
params={commit=true&command=delta-import} status=0 QTime=0

Aug 14, 2011 1:42:02 AM org.apache.solr.handler.dataimport.SolrWriter
readIndexerProperties

INFO: Read dataimport.properties

Aug 14, 2011 1:42:02 AM org.apache.solr.handler.dataimport.DocBuilder
doDelta

INFO: Starting delta collection.

Aug 14, 2011 1:42:02 AM org.apache.solr.handler.dataimport.DocBuilder
collectDelta

INFO: Running ModifiedRowKey() for Entity: event

Aug 14, 2011 1:42:02 AM org.apache.solr.handler.dataimport.JdbcDataSource$1
call

INFO: Creating a connection for entity event with URL: jdbc:mysql://
85.168.123.207:3306/AGENDA

Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.JdbcDataSource$1
call

INFO: Time taken for getConnection(): 865

Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.DocBuilder
collectDelta

INFO: Completed ModifiedRowKey for Entity: event rows obtained : 3

Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.DocBuilder
collectDelta

INFO: Completed DeletedRowKey for Entity: event rows obtained : 0

Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.DocBuilder
collectDelta

INFO: Completed parentDeltaQuery for Entity: event

Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.DocBuilder
doDelta

INFO: Delta Import completed successfully

Aug 14, 2011 1:42:03 AM org.apache.solr.update.processor.LogUpdateProcessor
finish

INFO: {} 0 0

Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.DocBuilder
execute

INFO: Time taken = 0:0:1.282


On 19 août 2011, at 10:39, Gora Mohanty  wrote:

On Fri, Aug 19, 2011 at 5:32 AM, Alexandre Sompheng 
wrote:

Hi guys, i try the delta import, i got logs saying that it found delta

data to update. But it seems that the index is not updated. Amy guess

why this happens ? Did i miss something? I'm on solr 3.3 with no

patch.

[...]

Please show us the following:
* The exact URL you loaded for delta-import
* The Solr response which shows the delta documents that it found,
  and the status of the delta-import.
If your index is large, and if you are running an optimise after the
delta-import (the default is to optimise), it can take some time.
Check the status: It will say "busy" if the optimise is still running.

Regards,
Gora


Re: Date Facet Question

2011-08-20 Thread Jamie Johnson
Makes complete sense, when faceting on dates I'm just checking to see
if NOW is in it and replace it with either the beginning of the day or
the end of the day (depending if it's lower or upper) and use that.
This works well.  Thanks for the quick response.

On Fri, Aug 19, 2011 at 3:13 PM, Chris Hostetter
 wrote:
>
> : when the response comes back the facet names are
> :
> : 2010-08-14T01:50:58.813Z
>        ...
> : instead of something like
> :
> : NOW-11MONTH
>        ...
> : where as facet queries if specifying a set of facet queries like
> :
> : datetime:[NOW-1YEAR TO NOW]
>        ...
> : the labels come back just as specified.  Is there a way to make date
> : range queries come back using the query specified and not the parsed
> : date?
>
> No.  If dates were the only factor here we could maybe add an option for
> that but the faceting code is all generalized now to support all numerics,
> so it wouldn't relaly make sense in general.
>
> it's also not clear how an option like this would work if/when stuff like
> SOLR-2366 get implemented - returning the concrete value used as the
> lowerbound of the range is un-ambiguious.
>
> the functional difference between facet.range and facet.query is pretty
> signifigant, so it's kind of an apples/oranges thing to compare their
> output -- with facet.query you can specify any arbitrary query
> expression your heart desires, and that literal unparsed query string
> is again used as the constraint key in the resulting NamedList because
> it's as unambiguious as we can be given the circumstances.
>
>
>
>
> -Hoss
>


Re: Why are not query keywords treated as a set?

2011-08-20 Thread Gabriele Kahlout
 Part of the query is 'injected' by my application while unaware of the user
query. Would I know that 'paste past' end up together as query 'past past' I
would not inject anything as it distorts the score calculation. I could
inject after it, but it is not easy.


So, trying to solve it right into the RequestHandler I've difficulties with
queries that contain phrases ("") or the 'must be present' + operator. For
example I'd not want to touch a user query: +"zusammen essen" +"alein essen"
where 'essen' is the duplicate term.

My 'good enough solution' is thus to not remove the duplicate in clauses
prefixed by + or ".

C := set of clauses in which duplicated term t occurs.
for each clause c in C:
do
if(!c.toString().startsWith(") &&
  !c.toString().startsWith(+) &&
  |C| > 1){
C.remove(c);
}
end

What do you think? Better solutions or algorithms to make sure the same term
occurs only once in a query, or at least it's weighted once only in the
score calculation?


On Mon, Jun 20, 2011 at 11:15 AM, Markus Jelsma
wrote:

> That only removed tokens on the same position, as the wiki explains.
>
> Gabrielle, why would you expect that? You input two tokens so you query for
> two tokens, why would it be a `set` ?
>
> > this might help in your analysis chain
> >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDupl
> > icatesTokenFilterFactory
> >
> > On 20 June 2011 04:21, Gabriele Kahlout 
> wrote:
> > > past past
> > > *past past*
> > > *content:past content:past*
> > >
> > > I was expecting the query to get parsed into content:past only and not
> > > content:past content:past.
> > >
> > > On Mon, Jun 20, 2011 at 12:12 AM, lee carroll
> > >
> > > wrote:
> > >> do you mean a phrase query? "past past"
> > >> can you give some more detail?
> > >>
> > >> On 18 June 2011 13:02, Gabriele Kahlout 
> wrote:
> > >> > q=past past
> > >> >
> > >> > 1.0 = (MATCH) sum of:
> > >> > *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
> > >> >   1.0 = tf(termFreq(content:past)=1)
> > >> >   1.0 = idf(docFreq=1, maxDocs=2)
> > >> >   0.5 = fieldNorm(field=content, doc=0)
> > >> > *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
> > >> >   1.0 = tf(termFreq(content:past)=1)
> > >> >   1.0 = idf(docFreq=1, maxDocs=2)
> > >> >   0.5 = fieldNorm(field=content, doc=0)
> > >> >
> > >> > Is there how I can treat the query keywords as a set?
> > >> >
> > >> > --
> > >> > Regards,
> > >> > K. Gabriele
> > >> >
> > >> > --- unchanged since 20/9/10 ---
> > >> > P.S. If the subject contains "[LON]" or the addressee acknowledges
> the
> > >> > receipt within 48 hours then I don't resend the email.
> > >> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > >>
> > >> time(x)
> > >>
> > >> > < Now + 48h) ⇒ ¬resend(I, this).
> > >> >
> > >> > If an email is sent by a sender that is not a trusted contact or the
> > >>
> > >> email
> > >>
> > >> > does not contain a valid code then the email is not received. A
> valid
> > >>
> > >> code
> > >>
> > >> > starts with a hyphen and ends with "X".
> > >> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧
> y
> > >> > ∈ L(-[a-z]+[0-9]X)).
> > >
> > > --
> > > Regards,
> > > K. Gabriele
> > >
> > > --- unchanged since 20/9/10 ---
> > > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > > receipt within 48 hours then I don't resend the email.
> > > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > > time(x) < Now + 48h) ⇒ ¬resend(I, this).
> > >
> > > If an email is sent by a sender that is not a trusted contact or the
> > > email does not contain a valid code then the email is not received. A
> > > valid code starts with a hyphen and ends with "X".
> > > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
> ∈
> > > L(-[a-z]+[0-9]X)).
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).