Re: suggester issues

2011-08-21 Thread Kuba Krzemien
Finally got it working - turns out you can't just add it to the lib dir as 
the wiki suggests. Unfortunately the only way is adding it to solr.war.


Thanks for your help.

--
From: "William Oberman" 
Sent: Friday, August 19, 2011 5:07 PM
To: 
Subject: Re: suggester issues


Hard to say, so I'll list the exact steps I took:
-Downloaded apache-solr-3.3.0 (I like to stick with releases vs. svn)
-Untar and cd
-ant
-Wrote my class below (under a peer directory in apache-solr-3.3.0)
-javac -cp 
../dist/apache-solr-core-3.3.0.jar:../lucene/build/lucene-core-3.3-SNAPSHOT.jar 
com/civicscience/SpellingQueryConverter.java

-jar cf cs.jar com
-Unzipped solr.war (under example)
-Added my cs.jar to lib (under web-inf)
-Rezipped solr.war
-Added: class="com.civicscience.SpellingQueryConverter"/> to solrconfig.xml

-Restarted jetty

And, that seemed to all work.

will

On Aug 19, 2011, at 10:44 AM, Kuba Krzemien wrote:

As far as I checked creating a custom query converter is the only way to 
make this work.
Unfortunately I have some problems with running it - after creating a JAR 
with my class (Im using your source code, obviously besides package and 
class names) and throwing it into the lib dir I've added name="queryConverter" class="mypackage.MySpellingQueryConverter"/> to 
solrconfig.xml.


I get a "SEVERE: org.apache.solr.common.SolrException: Error 
Instantiating QueryConverter, mypackage.MySpellingQueryConverter is not a 
org.apache.solr.spelling.QueryConverter".


What am I doing wrong?

--
From: "William Oberman" 
Sent: Thursday, August 18, 2011 10:35 PM
To: 
Subject: Re: suggester issues


I tried this:
package com.civicscience;

import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;

import org.apache.lucene.analysis.Token;
import org.apache.solr.spelling.QueryConverter;

/**
* Converts the query string to a Collection of Lucene tokens.
**/
public class SpellingQueryConverter extends QueryConverter  {

/**
 * Converts the original query string to a collection of Lucene Tokens.
 * @param original the original query string
 * @return a Collection of Lucene Tokens
 */
@Override
public Collection convert(String original) {
  if (original == null) {
return Collections.emptyList();
  }
  Collection result = new ArrayList();
  Token token = new Token(original, 0, original.length(), "word");
  result.add(token);
  return result;
}

}

And added it to the classpath, and now it does what I expect.

will


On Aug 18, 2011, at 2:33 PM, Alexei Martchenko wrote:

It can be done, I did that with shingles, but it's not the way it's 
meant to
be. The main problem with suggester is that we want compound words and 
we
never get them. I try to get "internet explorer" but when i enter in 
the

second word, "internet e" the suggester never finds "explorer".

2011/8/18 oberman_cs 


I was trying to deal with the exact same issue, with the exact same
results.
Is there really no way to feed a phrase into the suggester 
(spellchecker)

without it splitting the input phrase into words?

--
View this message in context:
http://lucene.472066.n3.nabble.com/suggester-issues-tp3262718p3265803.html
Sent from the Solr - User mailing list archive at Nabble.com.





--

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533




Re: suggester issues

2011-08-21 Thread Will Oberman



Sent from my iPhone

On Aug 21, 2011, at 5:54 AM, "Kuba Krzemien"  wrote:

Finally got it working - turns out you can't just add it to the lib  
dir as the wiki suggests. Unfortunately the only way is adding it to  
solr.war.


Thanks for your help.

--
From: "William Oberman" 
Sent: Friday, August 19, 2011 5:07 PM
To: 
Subject: Re: suggester issues


Hard to say, so I'll list the exact steps I took:
-Downloaded apache-solr-3.3.0 (I like to stick with releases vs. svn)
-Untar and cd
-ant
-Wrote my class below (under a peer directory in apache-solr-3.3.0)
-javac -cp ../dist/apache-solr-core-3.3.0.jar:../lucene/build/ 
lucene-core-3.3-SNAPSHOT.jar com/civicscience/ 
SpellingQueryConverter.java

-jar cf cs.jar com
-Unzipped solr.war (under example)
-Added my cs.jar to lib (under web-inf)
-Rezipped solr.war
-Added: class="com.civicscience.SpellingQueryConverter"/> to solrconfig.xml

-Restarted jetty

And, that seemed to all work.

will

On Aug 19, 2011, at 10:44 AM, Kuba Krzemien wrote:

As far as I checked creating a custom query converter is the only  
way to make this work.
Unfortunately I have some problems with running it - after  
creating a JAR with my class (Im using your source code, obviously  
besides package and class names) and throwing it into the lib dir  
I've added class="mypackage.MySpellingQueryConverter"/> to solrconfig.xml.


I get a "SEVERE: org.apache.solr.common.SolrException: Error  
Instantiating QueryConverter, mypackage.MySpellingQueryConverter  
is not a org.apache.solr.spelling.QueryConverter".


What am I doing wrong?

--
From: "William Oberman" 
Sent: Thursday, August 18, 2011 10:35 PM
To: 
Subject: Re: suggester issues


I tried this:
package com.civicscience;

import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;

import org.apache.lucene.analysis.Token;
import org.apache.solr.spelling.QueryConverter;

/**
* Converts the query string to a Collection of Lucene tokens.
**/
public class SpellingQueryConverter extends QueryConverter  {

/**
* Converts the original query string to a collection of Lucene  
Tokens.

* @param original the original query string
* @return a Collection of Lucene Tokens
*/
@Override
public Collection convert(String original) {
 if (original == null) {
   return Collections.emptyList();
 }
 Collection result = new ArrayList();
 Token token = new Token(original, 0, original.length(), "word");
 result.add(token);
 return result;
}

}

And added it to the classpath, and now it does what I expect.

will


On Aug 18, 2011, at 2:33 PM, Alexei Martchenko wrote:

It can be done, I did that with shingles, but it's not the way  
it's meant to
be. The main problem with suggester is that we want compound  
words and we
never get them. I try to get "internet explorer" but when i  
enter in the

second word, "internet e" the suggester never finds "explorer".

2011/8/18 oberman_cs 

I was trying to deal with the exact same issue, with the exact  
same

results.
Is there really no way to feed a phrase into the suggester  
(spellchecker)

without it splitting the input phrase into words?

--
View this message in context:
http://lucene.472066.n3.nabble.com/suggester-issues-tp3262718p3265803.html
Sent from the Solr - User mailing list archive at Nabble.com.





--

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533


Re: Terms.regex performance issue

2011-08-21 Thread Erick Erickson
Wait. Sometimes I get confused because gmail will substitute
* for bolding, so in my client it looks like you're searching infix (e.g.
leading and trailing wildcards). If that's the case, then your performance
will always be poor, it has to enumerate all the terms in the field...

If it's just bolding confusing me, then never mind

Best
Erick

On Fri, Aug 19, 2011 at 8:27 PM, O. Klein  wrote:
> Terms.prefix was just to compare performance.
>
> The use case was terms.regex=.*query.* And as Markus pointed out, this will
> prolly remain a bottleneck.
>
> I looked at the Suggester. But like many others I have been struggling to
> make it useful. It needs a custom queryConverter to give proper suggestions,
> but I havent tried this yet.
>
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Terms-regex-performance-issue-tp3268994p3269628.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Too many results in dismax queries with one word

2011-08-21 Thread Erick Erickson
The root problem here is "This is unacceptable for my client". The first
thing I'd suggest is that you work with your client and get them to define
what is acceptable. You'll be forever changing things (to no good purpose)
if all they can say is "that's not right".

For instance, you apparently have two competing requirements:
1> try to correct users input, which inevitably increases the results returned
2> narrow the search to the "right" results.

You can't have both every time!

So you could try something like going with a more-restrictive search
(no metaphone
comparison) first and, if the results returned weren't sufficient
firing the "broader" query
back, without showing the too-small results first.

You could work with your client and see if what they really want is
just the most relevant
results at the top of the list, in which case you can play with the
dismax field boosts
(by the way, what version of Solr are you using?)

You could work with the client to understand the user experience if
you use autocomplete
and/or faceting etc. to guide their explorations.

You could...

But none of that will help unless and until you and your client can
agree what is the
correct behavior ahead of time

Best
Erick

On Sat, Aug 20, 2011 at 11:04 AM, Rafał Piekarski (RaVbaker)
 wrote:
> Hi all,
>
> I have a database of e-commerce products (5M) and trying to build a search
> solution for it.
>
> I have used steemer, edgengram and doublemetaphone phonetic fields for
> omiting common typos in queries.  It works quite good with dismax QParser
> for queries longer than one word: "tv lc20", "sny psp 3001", "cannon 5d"
> etc. For not having too many results I manipulated with `mm` parameter. But
> when user type a single word like "ipad", "cannon". I always having a lot of
> results (~6). This is unacceptable for my client. He would like to have
> then only the `good` results. That particulary match specific query. It's
> hard to acomplish for me cause of use doublemetaphone field which converts
> words like "apt", "opt" and "ipad" and even "ipod" to the same phonetic word
> - APT. And then all of these  words are matched fairly the same gives me
> huge amount of results. Similar problems I have with other words like
> "canon", "canine" and "cannon" which are KNN in phonetic way. But lexically
> have different meanings: "canon" - camera, "canine" - cat food , "cannon" -
> may be a misspell for canon or part of book title about cannon weapons.
>
> My first idea was to make a second requestHandler without searching in
> *_phonetic fields. And use it for queries with only one word. But it didn't
> worked cause sometimes I want to correct user even if there is only one word
> and suggest him something better. Query "cannon" is a good example. I'm
> fairly sure that most of the time when someone type "cannon" it would be a
> typo for "canon" and I want to show user also CANON cameras. That's why I
> can't use second requestHandler for one word queries.
>
> I'm looking for any ideas how could I change my requestHandler.
>
> My regular queries are: http://localhost:8983/solr/select?q=cannon
>
> Below I put my configuration for requestHandler and schema.xml.
>
>
>
> solrconfig.xml:
>
> 
>   
> *:*
>     dismax
>     
>         title^1.3 title_text^0.9 title_phonetic^0.74 title_ng^0.17
>         title_ngram^0.54
>         producer_name^0.9 producer_name_text^0.89
>         category_path_text^0.8 category_path_phonetic^0.65
>         description^0.60 description_text^0.56
>     
>     title_text^1.1 title^1.2 description^0.3
>     3
>     0.1
>     2<100% 3<-1 5<85%
>
>     *,score
> 
> 
>
>
> schema.xml:
>
> 
> 
>    
>         omitNorms="true" positionIncrementGap="0" />
>     omitNorms="true" positionIncrementGap="0"/>
>         sortMissingLast="true" omitNorms="true" />
>         sortMissingLast="true" omitNorms="true" />
>         precisionStep="2" omitNorms="true" positionIncrementGap="0" />
>
>         positionIncrementGap="100">
>            
>                
>                
>        
>                        ignoreCase="true"
>                                words="stopwords_pl.txt"
>                enablePositionIncrements="true"
>                />
>         generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>
>                
>                
> 
>            
>        
>
>     positionIncrementGap="100">
>            
>                
>                
>                        ignoreCase="true"
>                words="stopwords_pl.txt"
>                enablePositionIncrements="true"
>                />
>         generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>
>                
>                
>            
>        
>
>
>     class="solr.TextField" >
>      
>        
>                        ignoreCase="true"
>                words="stopwords_pl.txt"
> 

Re: Update field value in the document based on value of another field in the document

2011-08-21 Thread Erick Erickson
Publishing stack traces does no good unless you also tell us what version
of Solr you are using. The source-file numbers do move around between
versions

Also, what line in your code is at the root of this chain? The very first thing
I'd do is just comment out your custom code (i.e. just ahve the
super.processAdd)
in your code and build up from there. Some printlns might show the problem by
testing, for instance, that doc is not null (I can't imagine why it
would be, but it's been
said that "It's not the things you don't know that'll kill you, it's
the things you do
know that aren't true".

Best
Erick

On Sat, Aug 20, 2011 at 2:39 PM, bhawna singh  wrote:
> Now that I have set it up using UpdateProcessorChain, I am running into null
> exeception.
> Here is what I have-
> In SolrConfig.xml
> 
>    
>  
>   
>   
>  
>
>
>   startup="lazy" >
>        
>            mychain
>        
> 
>
>
> Here is my java code-
> package mysolr;
>
>
> import java.io.IOException;
>
> import org.apache.solr.common.SolrInputDocument;
> import org.apache.solr.request.SolrQueryRequest;
> import org.apache.solr.request.SolrQueryResponse;
> import org.apache.solr.update.AddUpdateCommand;
> import org.apache.solr.update.processor.UpdateRequestProcessor;
> import org.apache.solr.update.processor.UpdateRequestProcessorFactory;
>
> public class AddConditionalFieldsFactory extends
> UpdateRequestProcessorFactory
> {
>  @Override
>  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
> SolrQueryResponse rsp, UpdateRequestProcessor next)
>  {
>      System.out.println("From customization:");
>    return new AddConditionalFields(next);
>  }
> }
>
> class AddConditionalFields extends UpdateRequestProcessor
> {
>  public AddConditionalFields( UpdateRequestProcessor next) {
>
>    super( next );
>  }
>
>  @Override
>  public void processAdd(AddUpdateCommand cmd) throws IOException {
>    SolrInputDocument doc = cmd.getSolrInputDocument();
>
>    Object v = doc.getFieldValue( "url" );
>    if( v != null ) {
>      String url =  v.toString();
>      if( url.contains("question") ) {
>        doc.addField( "tierFilter", "1" );
>      }
>    }
>
>    // pass it up the chain
>    super.processAdd(cmd);
>  }
> }
>
> Here is my Java code-
> and I get the following error when I try to index-
> Aug 20, 2011 10:48:43 AM org.apache.solr.common.SolrException log
> SEVERE: java.lang.AbstractMethodError  at
> org.apache.solr.update.processor.UpdateRequestProcessorChain.createProcessor(UpdateRequestProcessorChain.java:74)
>        at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:53)
>
>
> Any pointers please. I am using Solr 3.3
>
> Thanks,
> Bhawna
>
> On Thu, Aug 18, 2011 at 2:04 PM, simon  wrote:
>
>> An  UpdateRequestProcessor would do the trick. Look at the (rather minimal)
>> documentation and code example in
>> http://wiki.apache.org/solr/UpdateRequestProcessor
>>
>> -Simon
>>
>> On Thu, Aug 18, 2011 at 4:15 PM, bhawna singh 
>> wrote:
>>
>> > Hi All,
>> > I have a requirement to update a certain field value depending on the
>> field
>> > value of another field.
>> > To elaborate-
>> > I have a field called 'popularity' and a field called 'URL'. I need to
>> > assign popularity value depending on the domain (URL) ( I have the
>> > popularity and domain mapping in a text file).
>> >
>> > I am using CSVRequestHandler to import the data.
>> >
>> > What are the suggested ways to achieve this.
>> > Your quick response is much appreciated.
>> >
>> > Thanks,
>> > Bhawna
>> >
>>
>


Re: get update record from database using DIH

2011-08-21 Thread Erick Erickson
At a guess, you're not getting as many rows as you think, and commit has
nothing to do with it. But that's just a guess.

So the very first thing I'd do is be sure the SQL is doing what you
think. There's
a little-known data import debugging page that might help:

blahblahblah./solr/admin/dataimport.jsp

Best
Erick

On Sat, Aug 20, 2011 at 4:30 PM, Alexandre Sompheng  wrote:
> Actually I requested      .../dataimport?command=delta-import&commit=true
> And DIH in delta-import mode does not commit, you can se log below. My index
> is quite empty, maybe 10 data rows max... It's just the beginning.
>
>
> INFO: Starting Delta Import
>
> Aug 14, 2011 1:42:02 AM org.apache.solr.core.SolrCore execute
>
> INFO: [] webapp=/apache-solr-3.3.0 path=/dataimport
> params={commit=true&command=delta-import} status=0 QTime=0
>
> Aug 14, 2011 1:42:02 AM org.apache.solr.handler.dataimport.SolrWriter
> readIndexerProperties
>
> INFO: Read dataimport.properties
>
> Aug 14, 2011 1:42:02 AM org.apache.solr.handler.dataimport.DocBuilder
> doDelta
>
> INFO: Starting delta collection.
>
> Aug 14, 2011 1:42:02 AM org.apache.solr.handler.dataimport.DocBuilder
> collectDelta
>
> INFO: Running ModifiedRowKey() for Entity: event
>
> Aug 14, 2011 1:42:02 AM org.apache.solr.handler.dataimport.JdbcDataSource$1
> call
>
> INFO: Creating a connection for entity event with URL: jdbc:mysql://
> 85.168.123.207:3306/AGENDA
>
> Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.JdbcDataSource$1
> call
>
> INFO: Time taken for getConnection(): 865
>
> Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.DocBuilder
> collectDelta
>
> INFO: Completed ModifiedRowKey for Entity: event rows obtained : 3
>
> Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.DocBuilder
> collectDelta
>
> INFO: Completed DeletedRowKey for Entity: event rows obtained : 0
>
> Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.DocBuilder
> collectDelta
>
> INFO: Completed parentDeltaQuery for Entity: event
>
> Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.DocBuilder
> doDelta
>
> INFO: Delta Import completed successfully
>
> Aug 14, 2011 1:42:03 AM org.apache.solr.update.processor.LogUpdateProcessor
> finish
>
> INFO: {} 0 0
>
> Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.DocBuilder
> execute
>
> INFO: Time taken = 0:0:1.282
>
>
> On 19 août 2011, at 10:39, Gora Mohanty  wrote:
>
> On Fri, Aug 19, 2011 at 5:32 AM, Alexandre Sompheng 
> wrote:
>
> Hi guys, i try the delta import, i got logs saying that it found delta
>
> data to update. But it seems that the index is not updated. Amy guess
>
> why this happens ? Did i miss something? I'm on solr 3.3 with no
>
> patch.
>
> [...]
>
> Please show us the following:
> * The exact URL you loaded for delta-import
> * The Solr response which shows the delta documents that it found,
>  and the status of the delta-import.
> If your index is large, and if you are running an optimise after the
> delta-import (the default is to optimise), it can take some time.
> Check the status: It will say "busy" if the optimise is still running.
>
> Regards,
> Gora
>


Re: Too many results in dismax queries with one word

2011-08-21 Thread RaVbaker
Thanks for reply. I know that sometimes meeting all clients needs would be
impossible but then client recalls that competitive (commercial) product
already do that (but has other problems, like performance). And then I'm
obligated to try more tricks. :/

I'm currently using Solr 3.1 but thinking about migrating to latest stable
version - 3.3.

You correct, to meet client needs I have also used some hacks with boosting
queries (`bq` and `bf` parameters) but I omit that to make XMLs clearer.

You mentioned faceting. This is also one of my(my client?) problems. In the
user interface they want to have 5 categories for products. Those 5 should
be most relevance ones. When I get those with highest counts for one word
queries they are most of the time "not that which should be there". For
example with phrase "ipad" which actually has only 12 most relevant products
in category "tablets" but phonetic APT matches also part of model name for
hundreds of UPS power supplies and bath tubes . And these are on the list,
not tablets. :/

But you mentioned autocomplete which is something what I haven't watched
yet. I'll try with that and show it to my client.

-- 
Rafał "RaVbaker" Piekarski.

web: http://ja.ravbaker.net
mail: ravba...@gmail.com
jid/xmpp/aim: ravba...@gmail.com
mobile: +48-663-808-481


On Sun, Aug 21, 2011 at 4:20 PM, Erick Erickson wrote:

> The root problem here is "This is unacceptable for my client". The first
> thing I'd suggest is that you work with your client and get them to define
> what is acceptable. You'll be forever changing things (to no good purpose)
> if all they can say is "that's not right".
>
> For instance, you apparently have two competing requirements:
> 1> try to correct users input, which inevitably increases the results
> returned
> 2> narrow the search to the "right" results.
>
> You can't have both every time!
>
> So you could try something like going with a more-restrictive search
> (no metaphone
> comparison) first and, if the results returned weren't sufficient
> firing the "broader" query
> back, without showing the too-small results first.
>
> You could work with your client and see if what they really want is
> just the most relevant
> results at the top of the list, in which case you can play with the
> dismax field boosts
> (by the way, what version of Solr are you using?)
>
> You could work with the client to understand the user experience if
> you use autocomplete
> and/or faceting etc. to guide their explorations.
>
> You could...
>
> But none of that will help unless and until you and your client can
> agree what is the
> correct behavior ahead of time
>
> Best
> Erick
>
> On Sat, Aug 20, 2011 at 11:04 AM, Rafał Piekarski (RaVbaker)
>  wrote:
> > Hi all,
> >
> > I have a database of e-commerce products (5M) and trying to build a
> search
> > solution for it.
> >
> > I have used steemer, edgengram and doublemetaphone phonetic fields for
> > omiting common typos in queries.  It works quite good with dismax QParser
> > for queries longer than one word: "tv lc20", "sny psp 3001", "cannon 5d"
> > etc. For not having too many results I manipulated with `mm` parameter.
> But
> > when user type a single word like "ipad", "cannon". I always having a lot
> of
> > results (~6). This is unacceptable for my client. He would like to
> have
> > then only the `good` results. That particulary match specific query. It's
> > hard to acomplish for me cause of use doublemetaphone field which
> converts
> > words like "apt", "opt" and "ipad" and even "ipod" to the same phonetic
> word
> > - APT. And then all of these  words are matched fairly the same gives me
> > huge amount of results. Similar problems I have with other words like
> > "canon", "canine" and "cannon" which are KNN in phonetic way. But
> lexically
> > have different meanings: "canon" - camera, "canine" - cat food , "cannon"
> -
> > may be a misspell for canon or part of book title about cannon weapons.
> >
> > My first idea was to make a second requestHandler without searching in
> > *_phonetic fields. And use it for queries with only one word. But it
> didn't
> > worked cause sometimes I want to correct user even if there is only one
> word
> > and suggest him something better. Query "cannon" is a good example. I'm
> > fairly sure that most of the time when someone type "cannon" it would be
> a
> > typo for "canon" and I want to show user also CANON cameras. That's why I
> > can't use second requestHandler for one word queries.
> >
> > I'm looking for any ideas how could I change my requestHandler.
> >
> > My regular queries are: http://localhost:8983/solr/select?q=cannon
> >
> > Below I put my configuration for requestHandler and schema.xml.
> >
> >
> >
> > solrconfig.xml:
> >
> > 
> >   
> > *:*
> > dismax
> > 
> > title^1.3 title_text^0.9 title_phonetic^0.74 title_ng^0.17
> > title_ngram^0.54
> > producer_name^0.9 producer_name_text^0.89
> > category_path_text^0.8 category_path

Re: Requiring multiple matches of a term

2011-08-21 Thread Simon Willnauer
On Fri, Aug 19, 2011 at 6:26 PM, Michael Ryan  wrote:
> Is there a way to specify in a query that a term must match at least X times 
> in a document, where X is some value greater than 1?
>

One simple way of doing this is maybe to write a wrapper for TermQuery
that only returns docs with a Term Frequency  > X as far as I
understand the question those terms don't have to be within a certain
window right?

simon
> For example, I want to only get documents that contain the word "dog" three 
> times.  I've thought that using a proximity query with an arbitrary large 
> distance value might do it:
> "dog dog dog"~10
> And that does seem to return the results I expect.
>
> But when I try for more than three, I start getting unexpected result counts 
> as I change the proximity value:
> "dog dog dog dog"~10 returns 6403 results
> "dog dog dog dog"~20 returns 9291 results
> "dog dog dog dog"~30 returns 6395 results
>
> Anyone ever do something like this and know how I can accomplish this?
>
> -Michael
>


Re: Too many results in dismax queries with one word

2011-08-21 Thread Sujit Pal
Would it make sense to have a "Did you mean?" type of functionality for
which you use the EdgeNGram and Metaphone filters /if/ you don't get
appropriate results for the user query?

So when user types "cannon" and the application notices that there are
no cannons for sale in the index (0 results with standard analysis), it
then makes another query with the EdgeNGram and/or Metaphone filters and
come back with:

Did you mean "Canon", "Canine"?

Clicking on "Canon" or "Canine" would fire off a query for these terms.

That way your application doesn't guess what is right, it goes back and
asks the user what he wants.

-sujit

On Sun, 2011-08-21 at 17:19 +0200, Rafał Piekarski (RaVbaker) wrote:
> Thanks for reply. I know that sometimes meeting all clients needs would be
> impossible but then client recalls that competitive (commercial) product
> already do that (but has other problems, like performance). And then I'm
> obligated to try more tricks. :/
> 
> I'm currently using Solr 3.1 but thinking about migrating to latest stable
> version - 3.3.
> 
> You correct, to meet client needs I have also used some hacks with boosting
> queries (`bq` and `bf` parameters) but I omit that to make XMLs clearer.
> 
> You mentioned faceting. This is also one of my(my client?) problems. In the
> user interface they want to have 5 categories for products. Those 5 should
> be most relevance ones. When I get those with highest counts for one word
> queries they are most of the time "not that which should be there". For
> example with phrase "ipad" which actually has only 12 most relevant products
> in category "tablets" but phonetic APT matches also part of model name for
> hundreds of UPS power supplies and bath tubes . And these are on the list,
> not tablets. :/
> 
> But you mentioned autocomplete which is something what I haven't watched
> yet. I'll try with that and show it to my client.
> 
> -- 
> Rafał "RaVbaker" Piekarski.
> 
> web: http://ja.ravbaker.net
> mail: ravba...@gmail.com
> jid/xmpp/aim: ravba...@gmail.com
> mobile: +48-663-808-481
> 
> 
> On Sun, Aug 21, 2011 at 4:20 PM, Erick Erickson 
> wrote:
> 
> > The root problem here is "This is unacceptable for my client". The first
> > thing I'd suggest is that you work with your client and get them to define
> > what is acceptable. You'll be forever changing things (to no good purpose)
> > if all they can say is "that's not right".
> >
> > For instance, you apparently have two competing requirements:
> > 1> try to correct users input, which inevitably increases the results
> > returned
> > 2> narrow the search to the "right" results.
> >
> > You can't have both every time!
> >
> > So you could try something like going with a more-restrictive search
> > (no metaphone
> > comparison) first and, if the results returned weren't sufficient
> > firing the "broader" query
> > back, without showing the too-small results first.
> >
> > You could work with your client and see if what they really want is
> > just the most relevant
> > results at the top of the list, in which case you can play with the
> > dismax field boosts
> > (by the way, what version of Solr are you using?)
> >
> > You could work with the client to understand the user experience if
> > you use autocomplete
> > and/or faceting etc. to guide their explorations.
> >
> > You could...
> >
> > But none of that will help unless and until you and your client can
> > agree what is the
> > correct behavior ahead of time
> >
> > Best
> > Erick
> >
> > On Sat, Aug 20, 2011 at 11:04 AM, Rafał Piekarski (RaVbaker)
> >  wrote:
> > > Hi all,
> > >
> > > I have a database of e-commerce products (5M) and trying to build a
> > search
> > > solution for it.
> > >
> > > I have used steemer, edgengram and doublemetaphone phonetic fields for
> > > omiting common typos in queries.  It works quite good with dismax QParser
> > > for queries longer than one word: "tv lc20", "sny psp 3001", "cannon 5d"
> > > etc. For not having too many results I manipulated with `mm` parameter.
> > But
> > > when user type a single word like "ipad", "cannon". I always having a lot
> > of
> > > results (~6). This is unacceptable for my client. He would like to
> > have
> > > then only the `good` results. That particulary match specific query. It's
> > > hard to acomplish for me cause of use doublemetaphone field which
> > converts
> > > words like "apt", "opt" and "ipad" and even "ipod" to the same phonetic
> > word
> > > - APT. And then all of these  words are matched fairly the same gives me
> > > huge amount of results. Similar problems I have with other words like
> > > "canon", "canine" and "cannon" which are KNN in phonetic way. But
> > lexically
> > > have different meanings: "canon" - camera, "canine" - cat food , "cannon"
> > -
> > > may be a misspell for canon or part of book title about cannon weapons.
> > >
> > > My first idea was to make a second requestHandler without searching in
> > > *_phonetic fields. And use it for queries with

Re: Too many results in dismax queries with one word

2011-08-21 Thread Erick Erickson
I think Sujit has hit the nail on the head. Any program you try to write
that tries to guess what the user *really* meant will require endless
tinkering and *still* won't be right. If you only knew how annoying I
find Google's attempts to "help".

So perhaps concentrating on some interaction with the user, who is,
after all, the only one who really knows what they want is the best approach.

Best
Erick

On Sun, Aug 21, 2011 at 12:26 PM, Sujit Pal  wrote:
> Would it make sense to have a "Did you mean?" type of functionality for
> which you use the EdgeNGram and Metaphone filters /if/ you don't get
> appropriate results for the user query?
>
> So when user types "cannon" and the application notices that there are
> no cannons for sale in the index (0 results with standard analysis), it
> then makes another query with the EdgeNGram and/or Metaphone filters and
> come back with:
>
> Did you mean "Canon", "Canine"?
>
> Clicking on "Canon" or "Canine" would fire off a query for these terms.
>
> That way your application doesn't guess what is right, it goes back and
> asks the user what he wants.
>
> -sujit
>
> On Sun, 2011-08-21 at 17:19 +0200, Rafał Piekarski (RaVbaker) wrote:
>> Thanks for reply. I know that sometimes meeting all clients needs would be
>> impossible but then client recalls that competitive (commercial) product
>> already do that (but has other problems, like performance). And then I'm
>> obligated to try more tricks. :/
>>
>> I'm currently using Solr 3.1 but thinking about migrating to latest stable
>> version - 3.3.
>>
>> You correct, to meet client needs I have also used some hacks with boosting
>> queries (`bq` and `bf` parameters) but I omit that to make XMLs clearer.
>>
>> You mentioned faceting. This is also one of my(my client?) problems. In the
>> user interface they want to have 5 categories for products. Those 5 should
>> be most relevance ones. When I get those with highest counts for one word
>> queries they are most of the time "not that which should be there". For
>> example with phrase "ipad" which actually has only 12 most relevant products
>> in category "tablets" but phonetic APT matches also part of model name for
>> hundreds of UPS power supplies and bath tubes . And these are on the list,
>> not tablets. :/
>>
>> But you mentioned autocomplete which is something what I haven't watched
>> yet. I'll try with that and show it to my client.
>>
>> --
>> Rafał "RaVbaker" Piekarski.
>>
>> web: http://ja.ravbaker.net
>> mail: ravba...@gmail.com
>> jid/xmpp/aim: ravba...@gmail.com
>> mobile: +48-663-808-481
>>
>>
>> On Sun, Aug 21, 2011 at 4:20 PM, Erick Erickson 
>> wrote:
>>
>> > The root problem here is "This is unacceptable for my client". The first
>> > thing I'd suggest is that you work with your client and get them to define
>> > what is acceptable. You'll be forever changing things (to no good purpose)
>> > if all they can say is "that's not right".
>> >
>> > For instance, you apparently have two competing requirements:
>> > 1> try to correct users input, which inevitably increases the results
>> > returned
>> > 2> narrow the search to the "right" results.
>> >
>> > You can't have both every time!
>> >
>> > So you could try something like going with a more-restrictive search
>> > (no metaphone
>> > comparison) first and, if the results returned weren't sufficient
>> > firing the "broader" query
>> > back, without showing the too-small results first.
>> >
>> > You could work with your client and see if what they really want is
>> > just the most relevant
>> > results at the top of the list, in which case you can play with the
>> > dismax field boosts
>> > (by the way, what version of Solr are you using?)
>> >
>> > You could work with the client to understand the user experience if
>> > you use autocomplete
>> > and/or faceting etc. to guide their explorations.
>> >
>> > You could...
>> >
>> > But none of that will help unless and until you and your client can
>> > agree what is the
>> > correct behavior ahead of time
>> >
>> > Best
>> > Erick
>> >
>> > On Sat, Aug 20, 2011 at 11:04 AM, Rafał Piekarski (RaVbaker)
>> >  wrote:
>> > > Hi all,
>> > >
>> > > I have a database of e-commerce products (5M) and trying to build a
>> > search
>> > > solution for it.
>> > >
>> > > I have used steemer, edgengram and doublemetaphone phonetic fields for
>> > > omiting common typos in queries.  It works quite good with dismax QParser
>> > > for queries longer than one word: "tv lc20", "sny psp 3001", "cannon 5d"
>> > > etc. For not having too many results I manipulated with `mm` parameter.
>> > But
>> > > when user type a single word like "ipad", "cannon". I always having a lot
>> > of
>> > > results (~6). This is unacceptable for my client. He would like to
>> > have
>> > > then only the `good` results. That particulary match specific query. It's
>> > > hard to acomplish for me cause of use doublemetaphone field which
>> > converts
>> > > words like "apt", "opt" and "ipad" and even

Re: Terms.regex performance issue

2011-08-21 Thread O. Klein
Yeah, I was searching infix. It worked very nice for autocomplete.

Made a custom QueryConverter for the Suggester so it gives proper
suggestions for shingles. Will stick with that for now.

Thanx for the feedback.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Terms-regex-performance-issue-tp3268994p3273145.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Too many results in dismax queries with one word

2011-08-21 Thread RaVbaker
Thanks very much for your advice. I think I now better understand how to
make better use of solr. I have tested spellchecker and it looks like it let
me to achieve better results and hopefully we will satisfy the client.

In my solution I will change user query to use or not to use phonetic fields
based on results from spellcheck.collation and frequency of words. If I
wouldn't be sure what is better then I'll ask user through "did you mean"
and log his reply to make better choices in future.

Once again thanks a lot guys.

This is my example of query to spellchecker:

http://localhost:8983/solr/select?spellcheck=true&q=cannon&rows=0&spellcheck.collate=true&spellcheck.count=10&spellcheck.onlyMorePopular=true&spellcheck.extendedResults=on

-- 
Rafał "RaVbaker" Piekarski.

web: http://ja.ravbaker.net
mail: ravba...@gmail.com
jid/xmpp/aim: ravba...@gmail.com
mobile: +48-663-808-481


On Sun, Aug 21, 2011 at 6:36 PM, Erick Erickson wrote:

> I think Sujit has hit the nail on the head. Any program you try to write
> that tries to guess what the user *really* meant will require endless
> tinkering and *still* won't be right. If you only knew how annoying I
> find Google's attempts to "help".
>
> So perhaps concentrating on some interaction with the user, who is,
> after all, the only one who really knows what they want is the best
> approach.
>
> Best
> Erick
>
> On Sun, Aug 21, 2011 at 12:26 PM, Sujit Pal  wrote:
> > Would it make sense to have a "Did you mean?" type of functionality for
> > which you use the EdgeNGram and Metaphone filters /if/ you don't get
> > appropriate results for the user query?
> >
> > So when user types "cannon" and the application notices that there are
> > no cannons for sale in the index (0 results with standard analysis), it
> > then makes another query with the EdgeNGram and/or Metaphone filters and
> > come back with:
> >
> > Did you mean "Canon", "Canine"?
> >
> > Clicking on "Canon" or "Canine" would fire off a query for these terms.
> >
> > That way your application doesn't guess what is right, it goes back and
> > asks the user what he wants.
> >
> > -sujit
> >
> > On Sun, 2011-08-21 at 17:19 +0200, Rafał Piekarski (RaVbaker) wrote:
> >> Thanks for reply. I know that sometimes meeting all clients needs would
> be
> >> impossible but then client recalls that competitive (commercial) product
> >> already do that (but has other problems, like performance). And then I'm
> >> obligated to try more tricks. :/
> >>
> >> I'm currently using Solr 3.1 but thinking about migrating to latest
> stable
> >> version - 3.3.
> >>
> >> You correct, to meet client needs I have also used some hacks with
> boosting
> >> queries (`bq` and `bf` parameters) but I omit that to make XMLs clearer.
> >>
> >> You mentioned faceting. This is also one of my(my client?) problems. In
> the
> >> user interface they want to have 5 categories for products. Those 5
> should
> >> be most relevance ones. When I get those with highest counts for one
> word
> >> queries they are most of the time "not that which should be there". For
> >> example with phrase "ipad" which actually has only 12 most relevant
> products
> >> in category "tablets" but phonetic APT matches also part of model name
> for
> >> hundreds of UPS power supplies and bath tubes . And these are on the
> list,
> >> not tablets. :/
> >>
> >> But you mentioned autocomplete which is something what I haven't watched
> >> yet. I'll try with that and show it to my client.
> >>
> >> --
> >> Rafał "RaVbaker" Piekarski.
> >>
> >> web: http://ja.ravbaker.net
> >> mail: ravba...@gmail.com
> >> jid/xmpp/aim: ravba...@gmail.com
> >> mobile: +48-663-808-481
> >>
> >>
> >> On Sun, Aug 21, 2011 at 4:20 PM, Erick Erickson <
> erickerick...@gmail.com>wrote:
> >>
> >> > The root problem here is "This is unacceptable for my client". The
> first
> >> > thing I'd suggest is that you work with your client and get them to
> define
> >> > what is acceptable. You'll be forever changing things (to no good
> purpose)
> >> > if all they can say is "that's not right".
> >> >
> >> > For instance, you apparently have two competing requirements:
> >> > 1> try to correct users input, which inevitably increases the results
> >> > returned
> >> > 2> narrow the search to the "right" results.
> >> >
> >> > You can't have both every time!
> >> >
> >> > So you could try something like going with a more-restrictive search
> >> > (no metaphone
> >> > comparison) first and, if the results returned weren't sufficient
> >> > firing the "broader" query
> >> > back, without showing the too-small results first.
> >> >
> >> > You could work with your client and see if what they really want is
> >> > just the most relevant
> >> > results at the top of the list, in which case you can play with the
> >> > dismax field boosts
> >> > (by the way, what version of Solr are you using?)
> >> >
> >> > You could work with the client to understand the user experience if
> >> > you use autocomplete
> >> > an

Re: Terms.regex performance issue

2011-08-21 Thread Erick Erickson
Ah, in that case, comparing prefix and regex is an apples-to-oranges
comparison. I expect regex to be slower, but a fairer comparison
would be prefix to stuff* (which may be changed into a prefix
enumeration for all I know). But comparing infix to prefix doesn't tell you
much really

Best
Erick

P.S. There's no reason to do anything if you have a solution that works
already though.

On Sun, Aug 21, 2011 at 12:56 PM, O. Klein  wrote:
> Yeah, I was searching infix. It worked very nice for autocomplete.
>
> Made a custom QueryConverter for the Suggester so it gives proper
> suggestions for shingles. Will stick with that for now.
>
> Thanx for the feedback.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Terms-regex-performance-issue-tp3268994p3273145.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Terms.regex performance issue

2011-08-21 Thread O. Klein
Of course. Thats why I compared prefix to bla* and saw it was already a lot
slower.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Terms-regex-performance-issue-tp3268994p3273370.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Terms.regex performance issue

2011-08-21 Thread O. Klein
I see now in Suggester Wiki; Support for infix-suggestions is planned for
FSTLookup (which would be the only structure to support these).


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Terms-regex-performance-issue-tp3268994p3273711.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Requiring multiple matches of a term

2011-08-21 Thread Michael Ryan
> One simple way of doing this is maybe to write a wrapper for TermQuery
> that only returns docs with a Term Frequency  > X as far as I
> understand the question those terms don't have to be within a certain
> window right?

Correct. Terms can be anywhere in the document. I figured term frequencies 
might be involved, but wasn't sure how to actually do this.

> Hmmm... i would think the phrase query approach should work, but it's
> totally possible that there's something odd in the way phrase queries
> work that could cause a problem -- the best way to sanity test something
> like this is to try a really small self contained example that you can post
> for other people to try.

I've been able to reduce it pretty far, but I don't have a totally 
self-contained example yet. I haven't tried it out yet on a stock build of Solr 
(I'm using 3.2 with various patches). Right now I'm inserting a few documents 
with a text field that contains "dog dog dog", then repeatedly running q="dog 
dog dog dog"~1 with the queryResultCache disabled. The query is not giving me 
the same results each time (!!!). Sometimes all the documents are returned, 
sometimes a subset is returned, and sometimes no documents are returned.

So far I've traced it down to the "repeats" array in 
SloppyPhraseScorer.initPhrasePositions() - depending on the order of the 
elements in this array, the document may or may not match. I think the 
HashSet.toArray() call is to blame here, but I don't yet fully understand the 
expected behavior of the initPhrasePositions function...

-Michael


Solr Join with multiple query parameters

2011-08-21 Thread Cameron Hurst

Hi all,

I am trying to use the Join feature in SOLR trunk with limited sucess. I 
am able to make simple searches and get the returns of documents  as 
expected.  A query such a follows works perfectly fine and as expected:


http://localhost:8983/solr/core0/select?q={!join%20from=matchset_id_ss%20to=id}*:*

I can then add parameters to this search and make the search as follows, 
and it works fine.

http://localhost:8983/solr/core0/select?q={!join%20from=matchset_id_ss%20to=id}*:*&fq=status_s:completed

I get filtered results of documents that are completed. The issue I am 
now trying to face is how do I filter the initial search of documents 
based on multiple conditions and then get a list of documents through 
the join. Here is the search I am trying to do:


http://localhost:8983/solr/core0/select?start=0&q=*:*&fq=status_i:1&rows=30&fq=team_id_i:1223

This search returns everything I want as expected, now I want to apply 
the join statement. I have added in the join statement above to make the 
new link in any place I can think of, but it seems that the join 
statement is taking place before any of the other filters applied. The 
issue becomes is that the returned documents mapped by the 
matchset_id_ss do not have the field status_i or team_id_i, these only 
exist on the initial documents I am searching.


Is there a way that I can apply multiple filters first, then complete 
the join? And if that is possible, can I then add more filters after the 
join?


Thanks for the help,

Cameron



Re: How to implement Spell Checker using Solr?

2011-08-21 Thread anupamxyz
The changes for Solrconfig.xml in solr is as follows



  
  default
  
  solr.IndexBasedSpellChecker
  
  spell
  
  ./spellchecker
  
  0.7
  
  .0001



  jarowinkler
  lowerfilt
  
  org.apache.lucene.search.spell.JaroWinklerDistance
  ./spellchecker




textSpell


And for the Request handler, I have incorporated the following changes:




  
  true
  
  false
  
  default
  
  false
  
  5
true
  true


  spellcheck

  

The same is failing while crawling. I have reveretd my code for now. But can
try it once again and post the exception that I have been getting while
crawling.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-implement-Spell-Checker-using-Solr-tp3268450p3274069.html
Sent from the Solr - User mailing list archive at Nabble.com.