Re: How to Sort By a PageRank-Like Complicated Strategy?

2012-01-22 Thread Bing Li
Dear Shashi,

Thanks so much for your reply!

However, I think the value of PageRank is not a static one. It must update
on the fly. As I know, Lucene index is not suitable to be updated too
frequently. If so, how to deal with that?

Best regards,
Bing


On Sun, Jan 22, 2012 at 12:43 PM, Shashi Kant  wrote:

> Lucene has a mechanism to "boost" up/down documents using your custom
> ranking algorithm. So if you come up with something like Pagerank
> you might do something like doc.SetBoost(myboost), before writing to index.
>
>
>
> On Sat, Jan 21, 2012 at 5:07 PM, Bing Li  wrote:
> > Hi, Kai,
> >
> > Thanks so much for your reply!
> >
> > If the retrieving is done on a string field, not a text field, a complete
> > matching approach should be used according to my understanding, right? If
> > so, how does Lucene rank the retrieved data?
> >
> > Best regards,
> > Bing
> >
> > On Sun, Jan 22, 2012 at 5:56 AM, Kai Lu  wrote:
> >
> >> Solr is kind of retrieval step, you can customize the score formula in
> >> Lucene. But it supposes not to be too complicated, like it's better can
> be
> >> factorization. It also regards to the stored information, like
> >> TF,DF,position, etc. You can do 2nd phase rerank to the top N data you
> have
> >> got.
> >>
> >> Sent from my iPad
> >>
> >> On Jan 21, 2012, at 1:33 PM, Bing Li  wrote:
> >>
> >> > Dear all,
> >> >
> >> > I am using SolrJ to implement a system that needs to provide users
> with
> >> > searching services. I have some questions about Solr searching as
> >> follows.
> >> >
> >> > As I know, Lucene retrieves data according to the degree of keyword
> >> > matching on text field (partial matching).
> >> >
> >> > But, if I search data by string field (complete matching), how does
> >> Lucene
> >> > sort the retrieved data?
> >> >
> >> > If I want to add new sorting ways, Solr's function query seems to
> support
> >> > this feature.
> >> >
> >> > However, for a complicated ranking strategy, such PageRank, can Solr
> >> > provide an interface for me to do that?
> >> >
> >> > My ranking ways are more complicated than PageRank. Now I have to load
> >> all
> >> > of matched data from Solr first by keyword and rank them again in my
> ways
> >> > before showing to users. It is correct?
> >> >
> >> > Thanks so much!
> >> > Bing
> >>
>


Re: Improving Solr Spell Checker Results

2012-01-22 Thread David Radunz

James,

I worked out that I actually needed to 'apply' patch SOLR-2585, 
whoops. So I have done that now and it seems to return 
'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't even 
in the dictionary). Could something have changed in the trunk to make 
your patch no longer work? I had to manually merge the setup for the 
test case due to a new 'hyphens' test case. The settings I am use are:



explicit
10

false
10
true
true
true
10
1

5
1




default
spell
solr.DirectSolrSpellChecker



internal


0.5


2

1

5

4


0.01



spellchecker
true


With the query:

spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5

Cheers,

David


On 22/01/2012 2:03 AM, David Radunz wrote:

James,

Thanks again for your lengthy and informative response. I updated 
from SVN trunk again today and was successfully able to run 'ant 
test'. So I proceeded with trying your suggestions (for question 1 so 
far):


On 17/01/2012 5:32 AM, Dyer, James wrote:

David,

The spellchecker normally won't give suggestions for any term in your 
index.  So even if "wever" is misspelled in context, if it exists in 
the index the spell checker will not try correcting it.  There are 3 
workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x 
only).  See https://issues.apache.org/jira/browse/SOLR-2585
I have tried using this with the original test case of 'Signorney 
Wever'. I didn't notice any difference, although I am a little unclear 
as to what exactly this patch does. Nor am I really clear what to set 
either of the options to, so I set them both to '5'. I tried to find 
the test case it mentions, but it's not present in 
SpellCheckCollatorTest.java .. Any suggestions?


2. try "onlyMorePopular=true" in your request.  
(http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).  
But see the September 2, 2011 comment in SOLR-2585 about why this 
might not do what you'd hope it would.


Trying this did produce 'Signourney Weaver' as you would hope, but I 
am a little afraid of the downside. I would much more like a context 
sensative spell check that involves the terms around the correction.


3. If you're building your index on a, you can add a 
stopword filter that filters out all of the misspelt or rare words 
from the field that the dictionary is based.  This could be an 
arduous task, and it may or may not work well for your data.
I am currently using a copyField for all terms that are relevant, 
which is quite a lot and the dictionary would encompass a huge amount 
of data. Adding stopword filters would be out of the question as we 
presently have more than 30,000 products and this is for the initial 
launch, we intend to have many many more.


As for your second question, I take it you're using (e)dismax with 
multiple fields in "qf", right?  The only way I know to handle this 
is to create a  that combines all of the fields you search 
across.  Use this combined field to base your dictionary.  Also, 
specifying "spellcheck.maxCollationTries" with a non-zero value will 
weed out the nonsense word combinations that are likely to occur when 
doing this, ensuring that any collations provided will indeed yield 
hits.  The downside to doing this, of course, is it will make your 
first problem more acute in that there will be even more terms in 
your index that the spellchecker will ignore entirely, even if 
they're mispelled in context.  Once again, SOLR-2585 is designed to 
tackle this problem but it is still in its early stages, and thus far 
it is Trunk-only.
I tried setting spellcheck.maxCollationTries to 5 to see if it would 
help with the above problem, but it did not.


I have now tried using it in the context of question 2. I tried 
searching for 'Sigorney Wever' in the series name (which it's not 
present in, as its an actor):


spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5 



Suggestions for 'Sigourney' Wever were returned, but no spelling 
suggestions or ones for series names (which i doubt there would be) 
should have been returned.




You might also be interested in 
https://issues.apache.org/jira/browse/SOLR-2993 .  Although this is 
unrelated to your two questions, the patch on this issue introduces a 
new "ConjunctionSolrSpellChecker" which theoretically could be 
enhanced to do exactly what you want.  That is, you could 
(theoretically) create separate dictionaries for each of the fiel

RE: Tika0.10 language identifier in Solr3.5.0

2012-01-22 Thread nibing

Hi,  This is exactly what I hope you can elaborate on - analyzer that detects 
the language and then analyze accordingly. How to do that?  Thank you. 

Best Regards
Ni, Bing  

 > From: ted.dunn...@gmail.com
> Date: Fri, 20 Jan 2012 09:15:30 -0800
> Subject: Re: Tika0.10 language identifier in Solr3.5.0
> To: solr-user@lucene.apache.org
> 
> I think you misunderstood what I am suggesting.
> 
> I am suggesting an analyzer that detects the language and then "does the
> right thing" according to the language it finds.   As such, it would
> tokenize and stem English according to English rules, German by German
> rules and would probably do a sliding bigram window in Japanese and Chinese.
> 
> On Fri, Jan 20, 2012 at 8:54 AM, Erick Erickson 
> wrote:
> 
> > bq: Why not have a polyglot analyzer
> >
> > That could work, but it makes some compromises and assumes that your
> > languages are "close enough", I have absolutely no clue how that would
> > work for English and Chinese say.
> >
> > But it also introduces inconsistencies. Take stemming. Even though you
> > could easily stem in the correct language, throwing all those stems
> > into the same filed can produce interesting results at search time since
> > you run the risk of hitting something produced by one of the other
> > analysis chains.
> >
  

Re: Phonetic search for portuguese

2012-01-22 Thread Anderson vasconcelos
Anyone could help?

Thanks

2012/1/20, Anderson vasconcelos :
> Hi
>
> The phonetic filters (DoubleMetaphone, Metaphone, Soundex, RefinedSoundex,
> Caverphone) is only for english language or works for other languages? Have
> some phonetic filter for portuguese? If dont have, how i can implement
> this?
>
> Thanks
>


Re: Improving Solr Spell Checker Results

2012-01-22 Thread David Radunz

Hey James,

I have played around a bit more with the settings and tried setting 
spellcheck.maxResultsForSuggest=100 and spellcheck.maxCollations=3. This 
yields 'Sigourney Weaver' as ONE of the corrections, but it's the second 
one and not the first. Which is wrong if this is a patch for 'context 
sensative', because it doesn't really seem to honor any context at all. 
Unless I am missunderstanding this? Also, I don't really like 
maxResultsForSuggest as it means 'all or nothing'. If you set it to 10 
and there are 100 results, then you offer no corrections at all even if 
the term is missing in the dictionary entirely.


If I set spellcheck.maxResultsForSuggest=100 and 
spellcheck.maxCollations=3 and choose the collation with the largest 
'hits' I get Sigourney Weaver and other 'popular' terms. But say I 
searched for 'pork and chups', the 'popular' correction is 'park and 
chips' where as the first correction was correct: 'pork and chips'.


So really, none of the solutions either in this patch or Solr offer 
what I would truely call context sensative spell checking. That being, 
in a full text search engine you find documents based on terms and how 
close they are togehter in the document. It makes more than perfect 
sense to treat the dictionary like this, so that when there are multiple 
terms it offers suggestions for the terms that match closely to whats 
entered surrounding the term.


Example:

"Sigourney Wever" would never appear in a document ever.
"Sigourney Weaver" however has many 'hits' in exactly that order of 
words.


So there needs to be a way to boost suggestions based on adjacency...  
Much like the full text search operates.


Thoughts?

David

On 22/01/2012 9:56 PM, David Radunz wrote:

James,

I worked out that I actually needed to 'apply' patch SOLR-2585, 
whoops. So I have done that now and it seems to return 
'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't even 
in the dictionary). Could something have changed in the trunk to make 
your patch no longer work? I had to manually merge the setup for the 
test case due to a new 'hyphens' test case. The settings I am use are:



explicit
10

false
10
true
true
true
10
1

5
1




default
spell
solr.DirectSolrSpellChecker



internal


0.5


2

1

5

4


0.01



spellchecker
true


With the query:

spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5 



Cheers,

David


On 22/01/2012 2:03 AM, David Radunz wrote:

James,

Thanks again for your lengthy and informative response. I updated 
from SVN trunk again today and was successfully able to run 'ant 
test'. So I proceeded with trying your suggestions (for question 1 so 
far):


On 17/01/2012 5:32 AM, Dyer, James wrote:

David,

The spellchecker normally won't give suggestions for any term in 
your index.  So even if "wever" is misspelled in context, if it 
exists in the index the spell checker will not try correcting it.  
There are 3 workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x 
only).  See https://issues.apache.org/jira/browse/SOLR-2585
I have tried using this with the original test case of 'Signorney 
Wever'. I didn't notice any difference, although I am a little 
unclear as to what exactly this patch does. Nor am I really clear 
what to set either of the options to, so I set them both to '5'. I 
tried to find the test case it mentions, but it's not present in 
SpellCheckCollatorTest.java .. Any suggestions?


2. try "onlyMorePopular=true" in your request.  
(http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).  
But see the September 2, 2011 comment in SOLR-2585 about why this 
might not do what you'd hope it would.


Trying this did produce 'Signourney Weaver' as you would hope, but I 
am a little afraid of the downside. I would much more like a context 
sensative spell check that involves the terms around the correction.


3. If you're building your index on a, you can add a 
stopword filter that filters out all of the misspelt or rare words 
from the field that the dictionary is based.  This could be an 
arduous task, and it may or may not work well for your data.
I am currently using a copyField for all terms that are relevant, 
which is quite a lot and the dictionary would encompass a huge amount 
of data. Adding stopword filters would be out of the question as we 
presently have more than 30,000 products and this is for the initial 
launch, we intend to have many many more.


As for your second question, I take it you're using (e)dismax with 
multiple fields in "qf", right?  The only way I know to handle this 
is to create a  that combines all of the fields you 
search across.  Use this combined field to base your dic

Re: Phonetic search for portuguese

2012-01-22 Thread Gora Mohanty
On Sun, Jan 22, 2012 at 5:47 PM, Anderson vasconcelos
 wrote:
> Anyone could help?
>
> Thanks
>
> 2012/1/20, Anderson vasconcelos :
>> Hi
>>
>> The phonetic filters (DoubleMetaphone, Metaphone, Soundex, RefinedSoundex,
>> Caverphone) is only for english language or works for other languages? Have
>> some phonetic filter for portuguese? If dont have, how i can implement
>> this?

We did this, in another context, by using the open-source aspell library to
handle the spell-checking for us. This has distinct advantages as aspell
is well-tested, handles soundslike in a better manner at least IMHO, and
supports a wide variety of languages, including Portugese.

There are some drawbacks, as aspell only has C/C++ interfaces, and
hence we built bindings on top of SWIG. Also, we handled the integration
with Solr via a custom filter factory, though there are better ways to do this.
Such a project would thus, have dependencies on aspell, and our custom
code. If there is interest in this, we would be happy to open source this
code: Given our current schedule this could take 2-3 weeks.

Regards,
Gora


Re: How to Sort in a Different Way

2012-01-22 Thread yunfei wu
what kind of new sorting ways you want?

If you want to change Lucene's score of how relevant the result is, you may
play with the boosting. If you just want to sort on fields, you can use
"sort=fieldname" to sort on string, integer, date fields.

Yunfei


On Sat, Jan 21, 2012 at 8:39 AM, Bing Li  wrote:

> Dear all,
>
> I have a question when sorting retrieved data from Solr. As I know, Lucene
> retrieves data according to the degree of keyword matching on text field
> (partial matching).
>
> If I search data by string field (complete matching), how does Lucene sort
> the retrieved data?
>
> If I want to add new sorting ways, how to do that? Now I have to load all
> of matched data from Solr and rank them again in my ways before showing to
> users. It is correct?
>
> Thanks so much!
> Bing
>


Re: "index-time" over boosted

2012-01-22 Thread remi tassing
Hi,

I got wrong in beginning but putting omitNorms in the query url.

Now following your advice, I merged the schema.xml from Nutch and Solr and
made sure omitNorms was set to "true" for the content, just as you said.

Unfortunately the problem remains :-(

On Thursday, January 19, 2012, Jan Høydahl  wrote:
> Hi,
>
> The schema you pasted in your mail is NOT Solr3.5's default example
schema. Did you get it from the Nutch project?
>
> And the "omitNorms" parameter is supposed to go in the  tag in
schema.xml, and the "content" field in the example schema does not have
omitNorms="true". Try to change
>
>   
> to
>   
>
> and try again. Please note that you SHOULD customize your schema, there
is really no "default" schema in Solr (or Nutch), it's only an example or
starting point. For your search application to work well you will have to
invest some time in designing a schema, working with your queries, perhaps
exploring DisMax query parser etc etc.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 19. jan. 2012, at 13:01, remi tassing wrote:
>
>> Hello Jan,
>>
>> My schema wasn't changed from the release 3.5.0. The content can be seen
>> below:
>>
>> 
>>
>>>sortMissingLast="true" omitNorms="true"/>
>>>omitNorms="true"/>
>>>omitNorms="true"/>
>>>positionIncrementGap="100">
>>
>>
>>>ignoreCase="true" words="stopwords.txt"/>
>>>generateWordParts="1" generateNumberParts="1"
>>catenateWords="1" catenateNumbers="1" catenateAll="0"
>>splitOnCaseChange="1"/>
>>
>>>protected="protwords.txt"/>
>>
>>
>>
>>>positionIncrementGap="100">
>>
>>
>>
>>>generateWordParts="1" generateNumberParts="1"/>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

Re: Trying to understand SOLR memory requirements

2012-01-22 Thread Dave
I take it from the overwhelming silence on the list that what I've asked is
not possible? It seems like the suggester component is not well supported
or understood, and limited in functionality.

Does anyone have any ideas for how I would implement the functionality I'm
looking for. I'm trying to implement a single location auto-suggestion box
that will search across multiple DB tables. It would take several possible
inputs: city, state, country; state,county; or country. In addition, there
are many aliases for each city, state and country that map back to the
original city/state/country. Once they select a suggestion, that suggestion
needs to have certain information associated with it. It seems that the
Suggester component is not the right tool for this. Anyone have other ideas?

Thanks,
Dave

On Thu, Jan 19, 2012 at 6:09 PM, Dave  wrote:

> That was how I originally tried to implement it, but I could not figure
> out how to get the suggester to return anything but the suggestion. How do
> you do that?
>
>
> On Thu, Jan 19, 2012 at 1:13 PM, Robert Muir  wrote:
>
>> I really don't think you should put a huge json document as a search term.
>>
>> Just make "Brooklyn, New York, United States" or whatever you intend
>> the user to actually search on/type in as your search term.
>> put the rest in different fields (e.g. stored-only, not even indexed
>> if you dont need that) and have solr return it that way.
>>
>> On Thu, Jan 19, 2012 at 12:31 PM, Dave  wrote:
>> > In my original post I included one of my terms:
>> >
>> > Brooklyn, New York, United States?{ |id|: |2620829|,
>> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| },
>> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
>> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
>> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
>> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York,
>> United
>> > States| }
>> >
>> > I'm matching on the first part of the term (the part before the ?), and
>> > then the rest is being passed via JSON into Javascript, then converted
>> to a
>> > JSON term itself. Here is my data-config.xml file, in case it sheds any
>> > light:
>> >
>> > 
>> >  > >  driver="com.mysql.jdbc.Driver"
>> >  url=""
>> >  user=""
>> >  password=""
>> >  encoding="UTF-8"/>
>> >  
>> >> >pk="id"
>> >query="select p.id as placeid, c.id, c.plainname, c.name,
>> > p.timezone from countries c, places p where p.regionid = 1 AND p.cityid
>> = 1
>> > AND c.id=p.countryid AND p.settingid=1"
>> >transformer="TemplateTransformer">
>> >
>> >
>> >
>> >
>> >
>> >> template="${countries.plainname}?{
>> > |id|: |${countries.placeid}|, |timezone|:|${countries.timezone}|,|type|:
>> > |1|, |country|: { |id| : |${countries.id}|, |plainname|:
>> > |${countries.plainname}|, |name|: |${countries.plainname}| }, |region|:
>> {
>> > |id| : |0| }, |city|: { |id|: |0| }, |hint|: ||, |label|:
>> > |${countries.plainname}|, |value|: |${countries.plainname}|, |title|:
>> > |${countries.plainname}| }"/>
>> >
>> >> >pk="id"
>> >query="select p.id as placeid, p.countryid as countryid,
>> > c.plainname as countryname, p.timezone as timezone, r.id as regionid,
>> > r.plainname as regionname, r.population as regionpop from places p,
>> regions
>> > r, countries c where r.id = p.regionid AND p.settingid = 1 AND
>> p.regionid >
>> > 1 AND p.countryid=c.id AND p.cityid=1 AND r.population > 0"
>> >transformer="TemplateTransformer">
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >> >pk="id"
>> >query="select c2.id as cityid, c2.plainname as cityname,
>> > c2.population as citypop, p.id as placeid, p.countryid as countryid,
>> > c.plainname as countryname, p.timezone as timezone, r.id as regionid,
>> > r.plainname as regionname from places p, regions r, countries c, cities
>> c2
>> > where c2.id = p.cityid AND p.settingid = 1 AND p.regionid > 1 AND
>> > p.countryid=c.id AND r.id=p.regionid"
>> >transformer="TemplateTransformer">
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >  
>> > 
>> >
>> >
>> >
>> >
>> > On Thu, Jan 19, 2012 at 11:52 AM, Robert Muir  wrote:
>> >
>> >> I don't think the problem is FST, since it sorts offline in your case.
>> >>
>> >> More importantly, what are you trying to put into the FST?
>> >>
>> >> it appears you are indexing terms from your term dictionary, but your
>> >> term dictionary is over 1GB, why is that?
>> >>
>> >> what do your terms look like? 1GB for 2,784,937 documents does not make
>> >> sense.
>> >> for ex

facet pivot and range

2012-01-22 Thread Antoine LE FLOC'H
Hello,

I can't find anything related to what I would like to do: a facet.pivot but
have ranges on the second level, something like

facet.pivot=cat,price

where price is a range facet

facet.range=price&facet.range.start=0&facet.range.end=1000&facet.range.gap=10

Is it doable with Solr4 ? How did you do it ?

Thanks so much


Re: Phonetic search for portuguese

2012-01-22 Thread Anderson vasconcelos
Hi Gora, thanks for the reply.

I'm interesting in see how you did this solution. But , my time is not
to long and i need to create some solution for my client early. If
anyone knows some other simple and fast solution, please post on this
thread.

Gora, you could talk how you implemented the Custom Filter Factory and
how used this on SOLR?

Thanks


2012/1/22, Gora Mohanty :
> On Sun, Jan 22, 2012 at 5:47 PM, Anderson vasconcelos
>  wrote:
>> Anyone could help?
>>
>> Thanks
>>
>> 2012/1/20, Anderson vasconcelos :
>>> Hi
>>>
>>> The phonetic filters (DoubleMetaphone, Metaphone, Soundex,
>>> RefinedSoundex,
>>> Caverphone) is only for english language or works for other languages?
>>> Have
>>> some phonetic filter for portuguese? If dont have, how i can implement
>>> this?
>
> We did this, in another context, by using the open-source aspell library to
> handle the spell-checking for us. This has distinct advantages as aspell
> is well-tested, handles soundslike in a better manner at least IMHO, and
> supports a wide variety of languages, including Portugese.
>
> There are some drawbacks, as aspell only has C/C++ interfaces, and
> hence we built bindings on top of SWIG. Also, we handled the integration
> with Solr via a custom filter factory, though there are better ways to do
> this.
> Such a project would thus, have dependencies on aspell, and our custom
> code. If there is interest in this, we would be happy to open source this
> code: Given our current schedule this could take 2-3 weeks.
>
> Regards,
> Gora
>


Re: Sort for Retrieved Data

2012-01-22 Thread Erick Erickson
See belowl

On Fri, Jan 20, 2012 at 10:42 AM, Bing Li  wrote:
> Dear all,
>
> I have a question when sorting retrieved data from Solr. As I know, Lucene
> retrieves data according to the degree of keyword matching on text field
> (partial matching).
>
> If I search data by string field (complete matching), how does Lucene sort
> the retrieved data?
>
If scores match exactly, which may well be the case here, then the tiebreaker
is internal Lucene document id.

> If I add some filters, such as time, what about the sorting way?
>
It doesn't. Filters only restrict the result set, they have no influence
on sorting

> If I just need to top ones, is it proper to just add rows?
>
I don't understand what you're asking. If you want the top 100
rather than the top 10, yes you can increase the &rows
parameter or page (see &start).

> If I want to add new sorting ways, how to do that?

See the &sort parameter. This page comes up as the
first google search on "solr sort"
http://lucene.apache.org/solr/tutorial.html

Best
Erick
>
> Thanks so much!
> Bing


Re: Getting a word count frequency out of a page field

2012-01-22 Thread Erick Erickson
Faceting won't work at all. Its function is to return the count
of the *documents* that a value occurs in, so that's no good
for your use case.

"I don't know how to issue a proper SOLR query that returns a word count for
a paragraph of text such as the term "amplifier" for a field. For some
reason it only returns."

This is really unclear. Are you asking for the word counts of a paragraph
that contains "amplifier"? The number of times "amplifier" appears in
a paragraph? In a document?

And why do you want this information anyway? It might be an XY problem.

Best
Erick

On Fri, Jan 20, 2012 at 1:06 PM, solr user  wrote:
> SOLR reports the term occurrence for terms over all the documents. I am
> having trouble making a query that returns the term occurrence in a
> specific page field called, documentPageId.
>
> I don't know how to issue a proper SOLR query that returns a word count for
> a paragraph of text such as the term "amplifier" for a field. For some
> reason it only returns.
>
> The things I've tried only return a count for 1 occurrence of the term even
> though I see the term in the paragraph more than just once.
>
> I've tried faceting on the field, "contents"
>
> http://localhost:8983/solr/select?indent=on&q=*:*&wt=standard&facet=on&facet.field=documentPageId&facet.query=amplifier&facet.sort=lex&facet.missing=on&facet.method=count
>
> 
> 
> 21
> 
> 
> 
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 0
> 
> 
> 
> 
> 
> 
>
>
> In schema.xml:
>   indexed="true" />
>   multiValued="false"/>
>
> In solrconfig.xml:
>
>       filewrapper
>       caseNumber
>       pageNumber
>       documentId
>       contents
>       documentId
>       caseNumber
>       pageNumber
>      documentPageId
>       contents
>
> Thanks in advance,


Re: Validating solr user query

2012-01-22 Thread Erick Erickson
Good luck on that 

If you allow free-form input, bad queries are just going to happen. To prevent
this from getting to Solr, you essentially have to reproduce the
entire Solr/Lucene
parser. So why not just let the parser to it for you and present some pretty
message to the user?

The other thing you can do is build your own "advanced query page" that guides
the user through adding parentheses, ands, ors, nots, fuzzy, all that
jazz, but that's
often really painful to do.

But other than making a UI that makes it difficult to make bad queries
or parsing
the query, you're pretty much stuck...

Best
Erick

On Fri, Jan 20, 2012 at 2:52 PM, Dipti Srivastava
 wrote:
> Hi All,
> I ma using HTTP/JSON to search my documents in Solr. Now the client provides 
> the query on which the search is based.
> What is a good way to validate the query string provided by the user.
>
> On the other hand, if I want the user to build this query using some Solr api 
> instead of preparing a lucene query string which API can I use for this?
> I looked into
> SolrQuery in SolrJ but it does not appear to have a way to specify the more 
> complex queries with the boolean operators and operators such as ~,+,- etc.
>
> Basically, I am trying to avoid running into bad query strings built by the 
> caller.
>
> Thanks!
> Dipti
>
> 
> This message is private and confidential. If you have received it in error, 
> please notify the sender and remove it from your system.
>


Re: Getting a word count frequency out of a page field

2012-01-22 Thread solr user
See comments inline below.

On Sun, Jan 22, 2012 at 8:27 PM, Erick Erickson wrote:

> Faceting won't work at all. Its function is to return the count
> of the *documents* that a value occurs in, so that's no good
> for your use case.
>
> "I don't know how to issue a proper SOLR query that returns a word count
> for
> a paragraph of text such as the term "amplifier" for a field. For some
> reason it only returns."
>
> This is really unclear. Are you asking for the word counts of a paragraph
> that contains "amplifier"? The number of times "amplifier" appears in
> a paragraph? In a document?
>

I'm looking for the number of times the word or term appears in a paragraph
that I'm indexing as the field name "contents". I'm storing and indexing
the field name "contents" that contains multiple occurrences of the
term/word. However, when I query for that term it only reports that the
word/term appeared only once in the field name "contents".


>
> And why do you want this information anyway? It might be an XY problem.
>

I want to be able to search for word frequency for a page in a document
that has many pages. So I can report to the user that the term/word
occurred on page 1 "10" times. The user can click on the result and go
right the the page where the word/term appeared most frequently.

What do you mean an XY problem?



>
> Best
> Erick
>
> On Fri, Jan 20, 2012 at 1:06 PM, solr user  wrote:
> > SOLR reports the term occurrence for terms over all the documents. I am
> > having trouble making a query that returns the term occurrence in a
> > specific page field called, documentPageId.
> >
> > I don't know how to issue a proper SOLR query that returns a word count
> for
> > a paragraph of text such as the term "amplifier" for a field. For some
> > reason it only returns.
> >
> > The things I've tried only return a count for 1 occurrence of the term
> even
> > though I see the term in the paragraph more than just once.
> >
> > I've tried faceting on the field, "contents"
> >
> >
> http://localhost:8983/solr/select?indent=on&q=*:*&wt=standard&facet=on&facet.field=documentPageId&facet.query=amplifier&facet.sort=lex&facet.missing=on&facet.method=count
> >
> > 
> > 
> > 21
> > 
> > 
> > 
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 1
> > 0
> > 
> > 
> > 
> > 
> > 
> > 
> >
> >
> > In schema.xml:
> >   > indexed="true" />
> >   > multiValued="false"/>
> >
> > In solrconfig.xml:
> >
> >   filewrapper
> >   caseNumber
> >   pageNumber
> >   documentId
> >   contents
> >   documentId
> >   caseNumber
> >   pageNumber
> >  documentPageId
> >   contents
> >
> > Thanks in advance,
>


Re: Failure noticed from new...@zju.edu.cn

2012-01-22 Thread Erick Erickson
I've seen the spam filter be pretty aggressive with HTML formatting etc,
what happens when you just send them as "plain text"?

Best
Erick

On Sat, Jan 21, 2012 at 7:24 AM, David Radunz  wrote:
> Hey,
>
>    Every time I send a reply to the list I get a failure for
> new...@zju.edu.cn. Should I just ignore this? I am unsure if the message has
> been delivered...
>
> Cheers,
>
> David


Re: Tika0.10 language identifier in Solr3.5.0

2012-01-22 Thread Erick Erickson
Would "doing the right thing" include firing the results at different
fields based on the language detected? Your answer to Jan
seems to indicate not, in which case my original comments
stand. The main point is that mixing all the *results* of the
analysis chains for multiple languages into a single field
will likely result in "interesting" behavior. Not to say it won't
be satisfactory in your situation, but there are edge cases.

Best
Erick

On Fri, Jan 20, 2012 at 9:15 AM, Ted Dunning  wrote:
> I think you misunderstood what I am suggesting.
>
> I am suggesting an analyzer that detects the language and then "does the
> right thing" according to the language it finds.   As such, it would
> tokenize and stem English according to English rules, German by German
> rules and would probably do a sliding bigram window in Japanese and Chinese.
>
> On Fri, Jan 20, 2012 at 8:54 AM, Erick Erickson 
> wrote:
>
>> bq: Why not have a polyglot analyzer
>>
>> That could work, but it makes some compromises and assumes that your
>> languages are "close enough", I have absolutely no clue how that would
>> work for English and Chinese say.
>>
>> But it also introduces inconsistencies. Take stemming. Even though you
>> could easily stem in the correct language, throwing all those stems
>> into the same filed can produce interesting results at search time since
>> you run the risk of hitting something produced by one of the other
>> analysis chains.
>>


Re: Improving Solr Spell Checker Results

2012-01-22 Thread David Radunz

Hey,

I am trying to send this again as 'plain-text' to see if it 
delivers ok this time. All of the previous messages I sent should be below..


Cheers,

David

On 22/01/2012 11:42 PM, David Radunz wrote:

Hey James,

I have played around a bit more with the settings and tried 
setting spellcheck.maxResultsForSuggest=100 and 
spellcheck.maxCollations=3. This yields 'Sigourney Weaver' as ONE of 
the corrections, but it's the second one and not the first. Which is 
wrong if this is a patch for 'context sensative', because it doesn't 
really seem to honor any context at all. Unless I am missunderstanding 
this? Also, I don't really like maxResultsForSuggest as it means 'all 
or nothing'. If you set it to 10 and there are 100 results, then you 
offer no corrections at all even if the term is missing in the 
dictionary entirely.


If I set spellcheck.maxResultsForSuggest=100 and 
spellcheck.maxCollations=3 and choose the collation with the largest 
'hits' I get Sigourney Weaver and other 'popular' terms. But say I 
searched for 'pork and chups', the 'popular' correction is 'park and 
chips' where as the first correction was correct: 'pork and chips'.


So really, none of the solutions either in this patch or Solr 
offer what I would truely call context sensative spell checking. That 
being, in a full text search engine you find documents based on terms 
and how close they are togehter in the document. It makes more than 
perfect sense to treat the dictionary like this, so that when there 
are multiple terms it offers suggestions for the terms that match 
closely to whats entered surrounding the term.


Example:

"Sigourney Wever" would never appear in a document ever.
"Sigourney Weaver" however has many 'hits' in exactly that order 
of words.


So there needs to be a way to boost suggestions based on adjacency...  
Much like the full text search operates.


Thoughts?

David

On 22/01/2012 9:56 PM, David Radunz wrote:

James,

I worked out that I actually needed to 'apply' patch SOLR-2585, 
whoops. So I have done that now and it seems to return 
'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't 
even in the dictionary). Could something have changed in the trunk to 
make your patch no longer work? I had to manually merge the setup for 
the test case due to a new 'hyphens' test case. The settings I am use 
are:



explicit
10

false
10
true
true
true
10
1

5
1




default
spell
solr.DirectSolrSpellChecker



internal


0.5


2

1

5

4


0.01



spellchecker
true


With the query:

spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5 



Cheers,

David


On 22/01/2012 2:03 AM, David Radunz wrote:

James,

Thanks again for your lengthy and informative response. I 
updated from SVN trunk again today and was successfully able to run 
'ant test'. So I proceeded with trying your suggestions (for 
question 1 so far):


On 17/01/2012 5:32 AM, Dyer, James wrote:

David,

The spellchecker normally won't give suggestions for any term in 
your index.  So even if "wever" is misspelled in context, if it 
exists in the index the spell checker will not try correcting it.  
There are 3 workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x 
only).  See https://issues.apache.org/jira/browse/SOLR-2585
I have tried using this with the original test case of 'Signorney 
Wever'. I didn't notice any difference, although I am a little 
unclear as to what exactly this patch does. Nor am I really clear 
what to set either of the options to, so I set them both to '5'. I 
tried to find the test case it mentions, but it's not present in 
SpellCheckCollatorTest.java .. Any suggestions?


2. try "onlyMorePopular=true" in your request.  
(http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).  
But see the September 2, 2011 comment in SOLR-2585 about why this 
might not do what you'd hope it would.


Trying this did produce 'Signourney Weaver' as you would hope, but I 
am a little afraid of the downside. I would much more like a context 
sensative spell check that involves the terms around the correction.


3. If you're building your index on a, you can add a 
stopword filter that filters out all of the misspelt or rare words 
from the field that the dictionary is based.  This could be an 
arduous task, and it may or may not work well for your data.
I am currently using a copyField for all terms that are relevant, 
which is quite a lot and the dictionary would encompass a huge 
amount of data. Adding stopword filters would be out of the question 
as we presently have more than 30,000 products and this is for the 
initial launch, we intend to have many many more.


As for your second question, I ta

Re: Failure noticed from new...@zju.edu.cn

2012-01-22 Thread David Radunz

Hey,

That seems to have helped, I didn't get a failure notice re-sending 
the message. I'll have to keep that in mind.


Thanks very much,

David

On 23/01/2012 12:41 PM, Erick Erickson wrote:

I've seen the spam filter be pretty aggressive with HTML formatting etc,
what happens when you just send them as "plain text"?

Best
Erick

On Sat, Jan 21, 2012 at 7:24 AM, David Radunz  wrote:

Hey,

Every time I send a reply to the list I get a failure for
new...@zju.edu.cn. Should I just ignore this? I am unsure if the message has
been delivered...

Cheers,

David




Re: Improving Solr Spell Checker Results

2012-01-22 Thread Erick Erickson
I can't help with your *real* problem, but when looking at patches,
if the "resolution" field isn't set to something like "fixed" it means
that the patch has NOT  been applied to any code lines. There
also should be commit revisions specified in the comments.
If "Fix Versions" has values, that doesn't mean the patch has
been applied either, that's often just a statement of where
the patch *should* go.

And, between the time someone uploads a patch and it actually
gets *committed*, the underlying code line can, indeed,  change
and the patch doesn't apply cleanly. Since you've already had
to do this, could you upload your version that *does* apply
cleanly?

Best
Erick

On Sun, Jan 22, 2012 at 2:56 AM, David Radunz  wrote:
> James,
>
>    I worked out that I actually needed to 'apply' patch SOLR-2585, whoops.
> So I have done that now and it seems to return 'correctlySpelled=true' for
> 'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could
> something have changed in the trunk to make your patch no longer work? I had
> to manually merge the setup for the test case due to a new 'hyphens' test
> case. The settings I am use are:
>
> 
> explicit
> 10
>
> false
> 10
> true
> true
> true
> 10
> 1
>
> 5
> 1
> 
>
>
> 
> default
> spell
> solr.DirectSolrSpellChecker
>
> 
> internal
> 
> 0.5
> 
> 2
> 
> 1
> 
> 5
> 
> 4
> 
> 0.01
> 
> 
>
> spellchecker
> true
> 
>
> With the query:
>
> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5
>
> Cheers,
>
> David
>
>
>
> On 22/01/2012 2:03 AM, David Radunz wrote:
>>
>> James,
>>
>>    Thanks again for your lengthy and informative response. I updated from
>> SVN trunk again today and was successfully able to run 'ant test'. So I
>> proceeded with trying your suggestions (for question 1 so far):
>>
>> On 17/01/2012 5:32 AM, Dyer, James wrote:
>>>
>>> David,
>>>
>>> The spellchecker normally won't give suggestions for any term in your
>>> index.  So even if "wever" is misspelled in context, if it exists in the
>>> index the spell checker will not try correcting it.  There are 3
>>> workarounds:
>>> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).
>>>  See https://issues.apache.org/jira/browse/SOLR-2585
>>
>> I have tried using this with the original test case of 'Signorney Wever'.
>> I didn't notice any difference, although I am a little unclear as to what
>> exactly this patch does. Nor am I really clear what to set either of the
>> options to, so I set them both to '5'. I tried to find the test case it
>> mentions, but it's not present in SpellCheckCollatorTest.java .. Any
>> suggestions?
>>
>>> 2. try "onlyMorePopular=true" in your request.
>>>  (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).
>>>  But see the September 2, 2011 comment in SOLR-2585 about why this might not
>>> do what you'd hope it would.
>>
>>
>> Trying this did produce 'Signourney Weaver' as you would hope, but I am a
>> little afraid of the downside. I would much more like a context sensative
>> spell check that involves the terms around the correction.
>>>
>>>
>>> 3. If you're building your index on a, you can add a
>>> stopword filter that filters out all of the misspelt or rare words from the
>>> field that the dictionary is based.  This could be an arduous task, and it
>>> may or may not work well for your data.
>>
>> I am currently using a copyField for all terms that are relevant, which is
>> quite a lot and the dictionary would encompass a huge amount of data. Adding
>> stopword filters would be out of the question as we presently have more than
>> 30,000 products and this is for the initial launch, we intend to have many
>> many more.
>>>
>>>
>>> As for your second question, I take it you're using (e)dismax with
>>> multiple fields in "qf", right?  The only way I know to handle this is to
>>> create a  that combines all of the fields you search across.  Use
>>> this combined field to base your dictionary.  Also, specifying
>>> "spellcheck.maxCollationTries" with a non-zero value will weed out the
>>> nonsense word combinations that are likely to occur when doing this,
>>> ensuring that any collations provided will indeed yield hits.  The downside
>>> to doing this, of course, is it will make your first problem more acute in
>>> that there will be even more terms in your index that the spellchecker will
>>> ignore entirely, even if they're mispelled in context.  Once again,
>>> SOLR-2585 is designed to tackle this problem but it is still in its early
>>> stages, and thus far it is Trunk-only.
>>
>> I tried setting spellcheck.maxCollationTries to 5 to see if it would help
>> with the above problem, but it did not.
>>
>> I have now tried using it in the context of question 2. I 

Re: Improving Solr Spell Checker Results

2012-01-22 Thread David Radunz

Hey Erick,

Sure, can you explain the process to create the patch and upload it 
and i'll do it first thing tomorrow.


Thanks again for your help,

David

On 23/01/2012 12:51 PM, Erick Erickson wrote:

I can't help with your *real* problem, but when looking at patches,
if the "resolution" field isn't set to something like "fixed" it means
that the patch has NOT  been applied to any code lines. There
also should be commit revisions specified in the comments.
If "Fix Versions" has values, that doesn't mean the patch has
been applied either, that's often just a statement of where
the patch *should* go.

And, between the time someone uploads a patch and it actually
gets *committed*, the underlying code line can, indeed,  change
and the patch doesn't apply cleanly. Since you've already had
to do this, could you upload your version that *does* apply
cleanly?

Best
Erick

On Sun, Jan 22, 2012 at 2:56 AM, David Radunz  wrote:

James,

I worked out that I actually needed to 'apply' patch SOLR-2585, whoops.
So I have done that now and it seems to return 'correctlySpelled=true' for
'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could
something have changed in the trunk to make your patch no longer work? I had
to manually merge the setup for the test case due to a new 'hyphens' test
case. The settings I am use are:


explicit
10

false
10
true
true
true
10
1

5
1




default
spell
solr.DirectSolrSpellChecker


internal

0.5

2

1

5

4

0.01



spellchecker
true


With the query:

spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5

Cheers,

David



On 22/01/2012 2:03 AM, David Radunz wrote:

James,

Thanks again for your lengthy and informative response. I updated from
SVN trunk again today and was successfully able to run 'ant test'. So I
proceeded with trying your suggestions (for question 1 so far):

On 17/01/2012 5:32 AM, Dyer, James wrote:

David,

The spellchecker normally won't give suggestions for any term in your
index.  So even if "wever" is misspelled in context, if it exists in the
index the spell checker will not try correcting it.  There are 3
workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).
  See https://issues.apache.org/jira/browse/SOLR-2585

I have tried using this with the original test case of 'Signorney Wever'.
I didn't notice any difference, although I am a little unclear as to what
exactly this patch does. Nor am I really clear what to set either of the
options to, so I set them both to '5'. I tried to find the test case it
mentions, but it's not present in SpellCheckCollatorTest.java .. Any
suggestions?


2. try "onlyMorePopular=true" in your request.
  (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).
  But see the September 2, 2011 comment in SOLR-2585 about why this might not
do what you'd hope it would.


Trying this did produce 'Signourney Weaver' as you would hope, but I am a
little afraid of the downside. I would much more like a context sensative
spell check that involves the terms around the correction.


3. If you're building your index on a, you can add a
stopword filter that filters out all of the misspelt or rare words from the
field that the dictionary is based.  This could be an arduous task, and it
may or may not work well for your data.

I am currently using a copyField for all terms that are relevant, which is
quite a lot and the dictionary would encompass a huge amount of data. Adding
stopword filters would be out of the question as we presently have more than
30,000 products and this is for the initial launch, we intend to have many
many more.


As for your second question, I take it you're using (e)dismax with
multiple fields in "qf", right?  The only way I know to handle this is to
create athat combines all of the fields you search across.  Use
this combined field to base your dictionary.  Also, specifying
"spellcheck.maxCollationTries" with a non-zero value will weed out the
nonsense word combinations that are likely to occur when doing this,
ensuring that any collations provided will indeed yield hits.  The downside
to doing this, of course, is it will make your first problem more acute in
that there will be even more terms in your index that the spellchecker will
ignore entirely, even if they're mispelled in context.  Once again,
SOLR-2585 is designed to tackle this problem but it is still in its early
stages, and thus far it is Trunk-only.

I tried setting spellcheck.maxCollationTries to 5 to see if it would help
with the above problem, but it did not.

I have now tried using it in the context of question 2. I tried searching
for 'Sigorney Wever' in the series name (which it's not present in, as its
an actor):


sp

Re: Improving Solr Spell Checker Results

2012-01-22 Thread Erick Erickson
David:

There's some good info here:
http://wiki.apache.org/solr/HowToContribute#Working_With_Patches

But the short form is to go into solr_home and issue this command:
'svn diff > SOLR-2585.patch'. IDE's may also have a "create patch"
feature, but I find the straight SVN command more reliable.

Note I'm not saying that your patch will necessarily be picked up, but
it's a thoughtful gesture to upload a more current patch. In your
comments please identify what code line you're working on (4.x? 3.x?).

And when you upload, down near the bottom of the dialog box there'll be
a radio button about "grant ASF license" which is fairly important to
click for legal reasons

Thanks
Erick

On Sun, Jan 22, 2012 at 5:54 PM, David Radunz  wrote:
> Hey Erick,
>
>    Sure, can you explain the process to create the patch and upload it and
> i'll do it first thing tomorrow.
>
> Thanks again for your help,
>
> David
>
>
> On 23/01/2012 12:51 PM, Erick Erickson wrote:
>>
>> I can't help with your *real* problem, but when looking at patches,
>> if the "resolution" field isn't set to something like "fixed" it means
>> that the patch has NOT  been applied to any code lines. There
>> also should be commit revisions specified in the comments.
>> If "Fix Versions" has values, that doesn't mean the patch has
>> been applied either, that's often just a statement of where
>> the patch *should* go.
>>
>> And, between the time someone uploads a patch and it actually
>> gets *committed*, the underlying code line can, indeed,  change
>> and the patch doesn't apply cleanly. Since you've already had
>> to do this, could you upload your version that *does* apply
>> cleanly?
>>
>> Best
>> Erick
>>
>> On Sun, Jan 22, 2012 at 2:56 AM, David Radunz  wrote:
>>>
>>> James,
>>>
>>>    I worked out that I actually needed to 'apply' patch SOLR-2585,
>>> whoops.
>>> So I have done that now and it seems to return 'correctlySpelled=true'
>>> for
>>> 'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could
>>> something have changed in the trunk to make your patch no longer work? I
>>> had
>>> to manually merge the setup for the test case due to a new 'hyphens' test
>>> case. The settings I am use are:
>>>
>>> 
>>> explicit
>>> 10
>>>
>>> false
>>> 10
>>> true
>>> true
>>> true
>>> 10
>>> 1
>>>
>>> 5
>>> 1
>>> 
>>>
>>>
>>> 
>>> default
>>> spell
>>> solr.DirectSolrSpellChecker
>>>
>>> 
>>> internal
>>> 
>>> 0.5
>>> 
>>> 2
>>> 
>>> 1
>>> 
>>> 5
>>> 
>>> 4
>>> 
>>> 0.01
>>> 
>>> 
>>>
>>> spellchecker
>>> true
>>> 
>>>
>>> With the query:
>>>
>>>
>>> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5
>>>
>>> Cheers,
>>>
>>> David
>>>
>>>
>>>
>>> On 22/01/2012 2:03 AM, David Radunz wrote:

 James,

    Thanks again for your lengthy and informative response. I updated
 from
 SVN trunk again today and was successfully able to run 'ant test'. So I
 proceeded with trying your suggestions (for question 1 so far):

 On 17/01/2012 5:32 AM, Dyer, James wrote:
>
> David,
>
> The spellchecker normally won't give suggestions for any term in your
> index.  So even if "wever" is misspelled in context, if it exists in
> the
> index the spell checker will not try correcting it.  There are 3
> workarounds:
> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).
>  See https://issues.apache.org/jira/browse/SOLR-2585

 I have tried using this with the original test case of 'Signorney
 Wever'.
 I didn't notice any difference, although I am a little unclear as to
 what
 exactly this patch does. Nor am I really clear what to set either of the
 options to, so I set them both to '5'. I tried to find the test case it
 mentions, but it's not present in SpellCheckCollatorTest.java .. Any
 suggestions?

> 2. try "onlyMorePopular=true" in your request.
>
>  (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).
>  But see the September 2, 2011 comment in SOLR-2585 about why this
> might not
> do what you'd hope it would.


 Trying this did produce 'Signourney Weaver' as you would hope, but I am
 a
 little afraid of the downside. I would much more like a context
 sensative
 spell check that involves the terms around the correction.
>
>
> 3. If you're building your index on a, you can add a
> stopword filter that filters out all of the misspelt or rare words from
> the
> field that the dictionary is based.  This could be an arduous task, and
> it
> may or may not work well for your data.

 I am currently using a copyField for all terms that are relevant, which
 is
>>

Re: Phonetic search for portuguese

2012-01-22 Thread Gora Mohanty
On Mon, Jan 23, 2012 at 5:58 AM, Anderson vasconcelos
 wrote:
> Hi Gora, thanks for the reply.
>
> I'm interesting in see how you did this solution. But , my time is not
> to long and i need to create some solution for my client early. If
> anyone knows some other simple and fast solution, please post on this
> thread.

What is your time line? I will see if we can expedite the open
sourcing of this.

> Gora, you could talk how you implemented the Custom Filter Factory and
> how used this on SOLR?
[...]

That part is quite simple, though it is possible that I have not
correctly addressed all issues for a custom FilterFactory.
Please see:
  AspellFilterFactory: http://pastebin.com/jTBcfmd1
  AspellFilter:http://pastebin.com/jDDKrPiK

The latter loads a java_aspell library that is created by SWIG
by setting up Java bindings on top of SWIG, and configuring
it for the language of interest.

Next, you will need a library that encapsulates various
aspell functionality in Java. I am afraid that this is a little
long:
  Suggest: http://pastebin.com/6NrGCVma

Finally, you will have to set up the Solr schema to use
this filter factory, e.g., one could create a new Solr
TextField, where the solr.DoubleMetaphoneFilterFactory
is replaced with
com.mimirtech.search.solr.analysis.AspellFilterFactory

We can discuss further how to set this up, but should
probably take that discussion off-list.

Regards,
Gora


Re: Phonetic search for portuguese

2012-01-22 Thread Anderson vasconcelos
Thanks a lot Gora.
I need to delivery the first release for my client on 25 january.
With your explanation, i can negociate better the date to delivery of
this feature for next month, because i have other business rules for
delivery and this features is more complex than i thought.
I could help you to shared this solution with solr community. Maybe we
can create some component in google code, or something like that, wich
any solr user can use.

2012/1/23, Gora Mohanty :
> On Mon, Jan 23, 2012 at 5:58 AM, Anderson vasconcelos
>  wrote:
>> Hi Gora, thanks for the reply.
>>
>> I'm interesting in see how you did this solution. But , my time is not
>> to long and i need to create some solution for my client early. If
>> anyone knows some other simple and fast solution, please post on this
>> thread.
>
> What is your time line? I will see if we can expedite the open
> sourcing of this.
>
>> Gora, you could talk how you implemented the Custom Filter Factory and
>> how used this on SOLR?
> [...]
>
> That part is quite simple, though it is possible that I have not
> correctly addressed all issues for a custom FilterFactory.
> Please see:
>   AspellFilterFactory: http://pastebin.com/jTBcfmd1
>   AspellFilter:http://pastebin.com/jDDKrPiK
>
> The latter loads a java_aspell library that is created by SWIG
> by setting up Java bindings on top of SWIG, and configuring
> it for the language of interest.
>
> Next, you will need a library that encapsulates various
> aspell functionality in Java. I am afraid that this is a little
> long:
>   Suggest: http://pastebin.com/6NrGCVma
>
> Finally, you will have to set up the Solr schema to use
> this filter factory, e.g., one could create a new Solr
> TextField, where the solr.DoubleMetaphoneFilterFactory
> is replaced with
> com.mimirtech.search.solr.analysis.AspellFilterFactory
>
> We can discuss further how to set this up, but should
> probably take that discussion off-list.
>
> Regards,
> Gora
>


Re: Phonetic search for portuguese

2012-01-22 Thread Gora Mohanty
On Mon, Jan 23, 2012 at 9:21 AM, Anderson vasconcelos
 wrote:
> Thanks a lot Gora.
> I need to delivery the first release for my client on 25 january.
> With your explanation, i can negociate better the date to delivery of
> this feature for next month, because i have other business rules for
> delivery and this features is more complex than i thought.

OK.I have ideas on how to improve this solution, but
we can take these up at a later stage. We have tested
this solution, and I know that it works. I will also be
discussing with people here about how soon we can
open source this.

> I could help you to shared this solution with solr community. Maybe we
> can create some component in google code, or something like that, wich
> any solr user can use.

Yes, I have been meaning to do that forever, but work has
been intruding. We will put up something on BitBucket as
soon as possible.

Regards,
Gora


Solr Cores

2012-01-22 Thread Sujatha Arun
Hello,

We have in production a number of individual solr Instnaces on a single
JVM.As a result ,we see that the permgenSpace keeps increasing with each
additional instance added.

I would Like to know ,if we can have solr cores , instead of individual
instances.


   - Is there any limit to the number of cores ,for a single instance?
   - Will this decrease the permgen space as the LIB is shared.?
   - Would there be any decrease in performance with number of cores added?
   - Any thing else that I should know before moving into cores?


Any help would be appreciated?

Regards
Sujatha


Re: Search within words

2012-01-22 Thread jawedshamshedi
Hi
Thanks for the reply..
I am using NGramFilterFactory for this. But it's not working as desired.
Like I have a  field article_type that has been indexed using the below
mentioned field type.


 
   
   
   
 
 
   
   
 


The field definition for indexing is :

 
now the problem is that I have a value article_type field has values like
earrring and ring and it's required that when we search for ring earring
should also come. But it's not happening.

What else needs to be done in order to achieve this.

Any further help will be appreciated.  

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-within-words-tp3675210p3681044.html
Sent from the Solr - User mailing list archive at Nabble.com.