Re: Boosting based on field values

2010-07-04 Thread Indika Tantrigoda
Hi,

{!boost b=pow(1,featured_listing)} is the boost function I used.

Got the results as expected.

Thanks.

Regards,
Indika


On 3 July 2010 21:10, Indika Tantrigoda  wrote:

> Thanks for the info. I'll try this out.
>
> Regards,
> Indika
>
>
> On 3 July 2010 20:48, Ahmet Arslan  wrote:
>
>> > I'd like to know if its possible to boost the score of
>> > documents based on a
>> > field value. Ex. My schema has a field called isFeatured, and if the
>> > value of the field
>> > is true or "1" I'd like to
>> > have these documents come first in a query result
>> > regardless of the score.
>>
>> Yes it is possible. http://wiki.apache.org/solr/FunctionQuery
>>
>> http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
>>
>> Something like:
>> &q={!boost b=pow(x,abs(sub(isFeatured,1)),0.5)}yourQuery
>>
>>
>>
>>
>>
>


Re: Dilemma - Very Frequent Synonym updates for Huge Index

2010-07-04 Thread Erick Erickson
About reindexing and performance. This is not really a problem as you
can re-index on a completely different machine and then just
move the completed index to your production machines and reopen
your index. SOLR has this capability out of the box. Here's a link
to get you started:
http://wiki.apache.org/solr/SolrCollectionDistributionScripts

Your first few queries on a newly-opened index will be a bit slower
unless you do pre-warming. But the reindexing process can be
done without affecting the current searcher in any way. Of course
you'll need the disk space available, but disks are cheap ...

HTH
Erick

On Thu, Jul 1, 2010 at 2:06 PM, Ravi Kiran  wrote:

> Hello Mr. Høydahl,
>  I thought of doing it exactly as you have said,
> Shall try out and see where I land. However Iam still skeptical about that
> approach from the performance point of view as we are a round the clock
> news
> organization and huge reindexing might affect the speed of searches
> moreover
> in the news business "being first" is more important hence we need those
> synonyms to take affect right away and thats where we are in a quandry
>
>   With regards to the OpenNLP implementation, our design is plain vanilla
> outside of SOLR. We generate the XML on the fly with extracted entities
> from
> OpenNLP and then index it straight into SOLR. However, we do some sanity
> checks for locations prior to indexing using wordnet so that false
> positives
> are avoided in location names.
>
> Thanks,
>
> Ravi Kiran Bhaskar
>
> On Thu, Jul 1, 2010 at 5:40 AM, Jan Høydahl / Cominvent <
> jan@cominvent.com> wrote:
>
> > Hi,
> >
> > I think I would look at a hybrid approach, where you keep adding new
> > synonyms to a query-side qynonym dictionary for immediate effect. And
> then
> > every now and then or every Nth night you move those synonyms over to the
> > index-side dictionary and trigger a full reindex.
> >
> > A nice side effect of reindexing now and then could be that if your
> OpenNLP
> > extraction dictionaries have changed, it will be reflected too.
> >
> > BTW: Could you share details of your OpenNLP integration with us? I'm
> about
> > to do it on another project..
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> > Training in Europe - www.solrtraining.com
> >
> > On 1. juli 2010, at 06.57, Ravi Kiran wrote:
> >
> > > Hello,
> > >Hoping some solr guru can help me out here. We are a news
> > > organization trying to migrate 10 million documents from FAST to solr.
> > The
> > > plan is to have our Editorial team add/modify synonyms multiple times
> > during
> > > a day as they deem appropriate. Hence we plan on using query time
> > synonyms
> > > as we cannot reindex every time they modify the synonyms file(for the
> > > entities extracted by OpenNLP like locations/organizations/person names
> > from
> > > article body) . Since the synonyms are for names Iam concerned that the
> > > multi-phrase issue crops up with the query-time synonyms. for example
> > > synonyms could be as follows
> > >
> > > The Washington Post Co., The Washington Post, Washington Post, The
> Post,
> > > TWP, WAPO
> > > DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security
> > > USCIS, United States Citizenship and Immigration Services, U.S.C.I.S.
> > >
> > > Barack Obama,Barack H. Obama,Barack Hussein Obama,President Obama
> > > Hillary Clinton,Hillary R. Clinton,Hillary Rodham Clinton,Secretary
> > > Clinton,Sen. Clinton
> > > William J. Clinton,William Jefferson Clinton,President
> Clinton,President
> > > Bill Clinton
> > >
> > > Virginia, Va., VA
> > > D.C,Washington D.C, District of Columbia
> > >
> > > I have the following fieldType in schema.xml for the
> > keywords/entites...What
> > > issues should I be aware off ? And is there a better way to achieve it
> > > without having to reindex a million docs on each synonym change. NOTE
> > that I
> > > use tokenizerFactory="solr.KeywordTokenizerFactory" for the
> > > SynonymFilterFactory to keep the words intact without splitting
> > >
> > >
> > > > > sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
> > >  
> > >
> > >
> > > > > words="stopwords.txt,entity-stopwords.txt"
> > enablePositionIncrements="true"/>
> > >
> > >
> > >  
> > >  
> > >
> > >
> > > > > words="stopwords.txt,entity-stopwords.txt"
> > enablePositionIncrements="true"
> > > />
> > > > > tokenizerFactory="solr.KeywordTokenizerFactory"
> > >
> >
> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
> > > ignoreCase="true" expand="true" />
> > >
> > >  
> > >
> >
> >
>


FastVectorHighlighter and SynonymFilter

2010-07-04 Thread ito hayato
Hi all,

I try using concurrently SynonymFilter and
FastVectorHighlighter,
then I got empty highlight response.
As investigation of the causes, I set all combination of
hl.useVectorHighlighter and expand attribute.

---

---

Result:

 expand="true" in solrconfig.xml,and
hl.useFastVectorHighlighter=true
 -> Returned highlight is empty as below.

---







---

 expand="false" , and hl.useFastVectorHighlighter=true
 expand="true"  , and hl.useFastVectorHighlighter=false
 expand="false" , and hl.useFastVectorHighlighter=false

 -> On these cases,Highlighting has correct snippet.


Are SynonymFilter and FastVectorHighlighter not supported
using concurrently?
These component are not compatible?

In additional:
 - target field type is defined as following
 - target field is tokenized by CJKTokenizer.
 - This problem occured only when I search japanese
keyword.
  but not occured when English keyword.
  (trouble was caused by n-gram tokenize?)

---
 
   
 
   
   
 
 
   
 
---

--
2010 FIFA World Cup News [Yahoo!Sports/sportsnavi]
http://pr.mail.yahoo.co.jp/southafrica2010/


Duplicate items in distributed search

2010-07-04 Thread Andrew Clegg

Hi,

I'm after a bit of clarification about the 'limitations' section of the
distributed search page on the wiki.

The first two limitations say:

* Documents must have a unique key and the unique key must be stored
(stored="true" in schema.xml)

* When duplicate doc IDs are received, Solr chooses the first doc and
discards subsequent ones

Does 'doc ID' in the second point refer to the unique key in the first
point, or does it refer to the internal Lucene document ID?

Cheers,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Duplicate-items-in-distributed-search-tp942408p942408.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Duplicate items in distributed search

2010-07-04 Thread Mark Miller
On 7/4/10 11:41 AM, Andrew Clegg wrote:
> 
> Hi,
> 
> I'm after a bit of clarification about the 'limitations' section of the
> distributed search page on the wiki.
> 
> The first two limitations say:
> 
> * Documents must have a unique key and the unique key must be stored
> (stored="true" in schema.xml)
> 
> * When duplicate doc IDs are received, Solr chooses the first doc and
> discards subsequent ones
> 
> Does 'doc ID' in the second point refer to the unique key in the first
> point, or does it refer to the internal Lucene document ID?
> 
> Cheers,
> 
> Andrew.
> 

The 'doc ID' in the second point refers to the unique key in the first
point.

-- 
- Mark

http://www.lucidimagination.com


Re: Duplicate items in distributed search

2010-07-04 Thread Andrew Clegg


Mark Miller-3 wrote:
> 
> The 'doc ID' in the second point refers to the unique key in the first
> point.
> 

I thought so but thanks for clarifying. Maybe a wording change on the wiki
would be good?

Cheers,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Duplicate-items-in-distributed-search-tp942408p942554.html
Sent from the Solr - User mailing list archive at Nabble.com.


Using symlinks to alias cores

2010-07-04 Thread Andrew Clegg

Another question...

I have a series of cores representing historical data, only the most recent
of which gets indexed to.

I'd like to alias the most recent one to 'current' so that when they roll
over I can just change the alias, and the cron jobs etc. which manage
indexing don't have to change.

However, the wiki recommends against using the ALIAS command in CoreAdmin in
a couple of places, and SOLR-1637 says it's been removed now anyway.

If I can't use ALIAS safely, is it okay to just symlink the most recent
core's instance (or data) directory to 'current', and bring it up in Solr as
a separate core? Will this be safe, as long as all index writing happens via
the 'current' core?

Or will it cause Solr to get confused and do horrible things to the index?

Thanks!

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-symlinks-to-alias-cores-tp942567p942567.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Dilemma - Very Frequent Synonym updates for Huge Index

2010-07-04 Thread Ravi Kiran
Hello Mr.Høydahl,
  Yes your are right, we can selectively reindex
which would reduce the amount of indexing, but not by much for commonly
occurring entities. For example: George W. Bush / Barack Obama /Afghanistan
/ Iraq etc occurs in most of the documents in the last 5 years so they will
be a couple of million docs reindexed everytime. BTW my boss has mentioned I
wont be getting any new server due to budget constraints, so Iam stuck with
a single machine to do both reindex and searches.

With Query-Side-Only synonyms (no index time synonyms as Facets dont honor
synonyms) the issue would be all variations of the name will be displayed as
I use the field as a multiValued Facet field and display it (Our
requirements want only one variation shown as it will be easy to use a
alphabetical listing like A, B, C...Z).

I know it is not the right kind of design, considering millions of entities
should not be made Facets, but my business requirements also state that only
if there are more than 5 occurrences of an entity it is eligible for
displayand hence I can use facet.keyword.mincount=5 configured into my
solrconfig.xml which is quite easy. Thats my motivation for using Facets.

Ideally for my SynonymFilter I want expand="false" (to make sure only one
variant shows in display) at index time and expand="true" at query time (so
that newly added synonym on core reload will instantly work). But an inner
class method called MultiPhraseWeight.scorer in MultiPhraseQuery' throws
errors because of Multi-Word synonyms probably are not supported at query
time. I donot know why solr chose to use WhiteSpaceTokenizer even when the
tokenizer for a field is explicitly defined in the schema.xml (in my case
KeywordTokenizer)

Thanks for your continued interest in answering my questions.

Ravi Kiran Bhaskar


On Thu, Jul 1, 2010 at 7:08 PM, Jan Høydahl / Cominvent <
jan@cominvent.com> wrote:

> Hi,
>
> Another more complex approach is to design a routine that once in a while
> selectively decides what documents to reindex based on a query on the newly
> added synonym entries, and refeeds those with the new index-side dictionary
> in place. Could work well.
>
> I would consider an architecture where your indexeres only do indexing
> (except at disaster where they can do search as well) - in that case you can
> happily reindex without worrying about affecting user experience.
>
> What exactly is the issue you see with the query-side-only synonym
> expansion when using KeywordTokenizer?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 1. juli 2010, at 20.06, Ravi Kiran wrote:
>
> > Hello Mr. Høydahl,
> >  I thought of doing it exactly as you have said,
> > Shall try out and see where I land. However Iam still skeptical about
> that
> > approach from the performance point of view as we are a round the clock
> news
> > organization and huge reindexing might affect the speed of searches
> moreover
> > in the news business "being first" is more important hence we need those
> > synonyms to take affect right away and thats where we are in a quandry
> >
> >   With regards to the OpenNLP implementation, our design is plain vanilla
> > outside of SOLR. We generate the XML on the fly with extracted entities
> from
> > OpenNLP and then index it straight into SOLR. However, we do some sanity
> > checks for locations prior to indexing using wordnet so that false
> positives
> > are avoided in location names.
> >
> > Thanks,
> >
> > Ravi Kiran Bhaskar
> >
> > On Thu, Jul 1, 2010 at 5:40 AM, Jan Høydahl / Cominvent <
> > jan@cominvent.com> wrote:
> >
> >> Hi,
> >>
> >> I think I would look at a hybrid approach, where you keep adding new
> >> synonyms to a query-side qynonym dictionary for immediate effect. And
> then
> >> every now and then or every Nth night you move those synonyms over to
> the
> >> index-side dictionary and trigger a full reindex.
> >>
> >> A nice side effect of reindexing now and then could be that if your
> OpenNLP
> >> extraction dictionaries have changed, it will be reflected too.
> >>
> >> BTW: Could you share details of your OpenNLP integration with us? I'm
> about
> >> to do it on another project..
> >>
> >> --
> >> Jan Høydahl, search solution architect
> >> Cominvent AS - www.cominvent.com
> >> Training in Europe - www.solrtraining.com
> >>
> >> On 1. juli 2010, at 06.57, Ravi Kiran wrote:
> >>
> >>> Hello,
> >>>   Hoping some solr guru can help me out here. We are a news
> >>> organization trying to migrate 10 million documents from FAST to solr.
> >> The
> >>> plan is to have our Editorial team add/modify synonyms multiple times
> >> during
> >>> a day as they deem appropriate. Hence we plan on using query time
> >> synonyms
> >>> as we cannot reindex every time they modify the synonyms file(for the
> >>> entities extracted by OpenNLP like locations/organizations/person 

Error in building Solr-Cloud (ant example)

2010-07-04 Thread jayf

Hi there,

I'm having a trouble installing Solr Cloud. I checked out the project, but
when compiling ("ant example" on OSX) I get compile a error (cannot find
symbol - pasted below). 

I also get a bunch of warnings:
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
I have tried both Java 1.5 and 1.6. 


Before I got to this point, I was having problems with the included
ZooKeeper jar (java versioning issue) - so I had to download the source and
build this. Now 'ant' gets a bit further, to the stage listed above. 

Any idea of the problem??? THANKS!

[javac] Compiling 438 source files to
/Volumes/newpart/solrcloud/cloud/build/solr
[javac]
/Volumes/newpart/solrcloud/cloud/src/java/org/apache/solr/cloud/ZkController.java:588:
cannot find symbol
[javac] symbol  : method stringPropertyNames()
[javac] location: class java.util.Properties
[javac] for (String sprop :
System.getProperties().stringPropertyNames()) {

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-in-building-Solr-Cloud-ant-example-tp942836p942836.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Error in building Solr-Cloud (ant example)

2010-07-04 Thread Mark Miller
Hey jayf -

Offhand I'm not sure why you are having these issues - last I knew, a
couple people had had success with the cloud branch. Cloud has moved on
from that branch really though - we probably should update the wiki
about that. More important than though, that I need to get Cloud
committed to trunk!

I've been saying it for a while, but I'm going to make a strong effort
to wrap up the final unit test issue (apparently a testing issue, not
cloud issue) and get this committed for further iterations.

The way to follow along with the latest work is to go to :
https://issues.apache.org/jira/browse/SOLR-1873

The latest patch there should apply to recent trunk.

I've scheduled a bit of time to work on getting this committed this
week, fingers crossed.

-- 
- Mark

http://www.lucidimagination.com

On 7/4/10 3:37 PM, jayf wrote:
> 
> Hi there,
> 
> I'm having a trouble installing Solr Cloud. I checked out the project, but
> when compiling ("ant example" on OSX) I get compile a error (cannot find
> symbol - pasted below). 
> 
> I also get a bunch of warnings:
> [javac] Note: Some input files use or override a deprecated API.
> [javac] Note: Recompile with -Xlint:deprecation for details.
> I have tried both Java 1.5 and 1.6. 
> 
> 
> Before I got to this point, I was having problems with the included
> ZooKeeper jar (java versioning issue) - so I had to download the source and
> build this. Now 'ant' gets a bit further, to the stage listed above. 
> 
> Any idea of the problem??? THANKS!
> 
> [javac] Compiling 438 source files to
> /Volumes/newpart/solrcloud/cloud/build/solr
> [javac]
> /Volumes/newpart/solrcloud/cloud/src/java/org/apache/solr/cloud/ZkController.java:588:
> cannot find symbol
> [javac] symbol  : method stringPropertyNames()
> [javac] location: class java.util.Properties
> [javac] for (String sprop :
> System.getProperties().stringPropertyNames()) {
> 




Re: Duplicate items in distributed search

2010-07-04 Thread Mark Miller
On 7/4/10 12:49 PM, Andrew Clegg wrote:
> 
> 
> Mark Miller-3 wrote:
>>
>> The 'doc ID' in the second point refers to the unique key in the first
>> point.
>>
> 
> I thought so but thanks for clarifying. Maybe a wording change on the wiki
> would be good?
> 
> Cheers,
> 
> Andrew.
> 

Sounds like a good idea - go ahead and make the change if you'd like.

-- 
- Mark

http://www.lucidimagination.com


Re: Duplicate items in distributed search

2010-07-04 Thread Andrew Clegg


Mark Miller-3 wrote:
> 
> On 7/4/10 12:49 PM, Andrew Clegg wrote:
>> I thought so but thanks for clarifying. Maybe a wording change on the
>> wiki
> 
> Sounds like a good idea - go ahead and make the change if you'd like.
> 

That page seems to be marked immutable...
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Duplicate-items-in-distributed-search-tp942408p942984.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Field Collapse question

2010-07-04 Thread Martijn v Groningen
Hi Ken,

Not collapsing on null field values is not possible in the patch.
However you can if you want to fix this in the patch it is a really
small change. Assuming that you're using the default collapsing
algorithm you can add the following piece of code in the
NonAdjacentDocumentCollapser.java file:
if (currentValue == null) {
   continue;
}

Place it in the doCollapsing method after the following statement:
String currentValue = values.lookup[values.order[currentId]];

This makes sure that documents that have no value in the collapse
field are not collapsed.

Field collapsing has a big impact on your search times and also on
memory usage to a lesser extend. It can increase search times up to 10
times, but this depends per situation. As indexes get bigger this
becomes a bigger problem. Also using field collapsing in a distributed
environment can cause problems. This is due that collapse information
is not shared between shards, resulting in incorrect collapse results.
They only work around for this problem I know is, that you 'll have to
make sure that the groups are distributes evenly between shards and
that a group's documents are not spread across shards.

Other then that there are no further major issues with this patch.
Many people are using this patch in their Solr setups, but it is a
patch so you 'll have to keep that in mind. There are efforts to put
grouping functionality into Solr (without patching) in SOLR-236's
child issues, so keep an eye on these issues.

Cheers,

Martijn

On 3 July 2010 19:20, osocurious2  wrote:
>
> 
>
> I wanted to extend my question some. My original question about collapsing
> null fields is still open, but in trying to research elsewhere I see a lot
> of angst about the Field Collapse functionality in general. Can anyone
> summarize what the current state of affairs is with it? I'm on Solr 1.4,
> just the latest release build, not any current builds. Field Collapse seems
> to be in my build because I could do single field collapse just fine (hence
> my null field question). However there seems to be talk of problems with
> Field Collapse that aren't fixed yet. What kinds of issues are people
> having? Should I avoid Field Collapse in a production app for now? (tricky
> because I'm merging my schema with a third party tool schema and they are
> using Field Collapse).
>
> Any insight would be helpful, thanks
> Ken
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Field-Collapse-question-tp939118p940923.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Query modification

2010-07-04 Thread Chris Hostetter

: So QueryComponent is the place to do this? Are query analyzers already done?

I would actaully suggest doing it in a QParserPlugin, that way it can be 
reused in mulitple parsing sitautions and the stock search behavior of 
QueryComponent (including distributed search) can function as is.

: Would I have access to stems, synonyms, tokens, etc of the query?

those concepts only make sense in the context of analysis -- in either the 
QueryComponent or a QParser you have access to the query string and you 
can do whatever analysis you want on it (with the full IndexSchema at your 
disposal)



-Hoss



Re: Bizarre Terms revisited

2010-07-04 Thread Chris Hostetter

: Subject: Bizarre Terms revisited

to clarify for folks: this is a followup to this previous thread...

http://search.lucidimagination.com/search/document/c0aabe47aad1ca3c/bizarre_tfv_output#de3abb42754407d6

: Using MLT, I get terms that appear to be long concatenations of words
: that are space delimited in the original text.
: I can't think of any reason for these sentence-like terms to exist  (see
: below).

You never answered the questions i asked in the previous thread...

http://search.lucidimagination.com/search/document/c0aabe47aad1ca3c/bizarre_tfv_output#ed49cebdd92db674
>> Did you try pasting that text into the analysis page to see exactly what 
>> your "text_t" field does with it at analysis time like ia suggested?
>> 
>> My best hunch is that the "spaces" are not your typical basic "space" 
>> character (hex 20) and maybe the tokenizer you are using doesn't 
>> tokenize on them, but then perhaps something like word delimiter treats 
>> them as non-word characters and chews them up.
...
>> (Tip: if you use the JSON response writer (wt=json) when looking at the 
>> stored field value, it will help you see exactly what characters were in 
>> the original values by showing you the unicode escapes)

FWIW: I cut/pasted the text you provided...

: Original text (partially snipped) as it appears in the stored index.
: 
: "Ontreweb Product Features 
: 
:      
: 
: Unlimited mutliword and phrase matching Multiple inheritance of concepts 
Pluggable vocabularies, ontologies Multilingual 
: lexicons: french, english, etc. Search in one language, find results in 
another 200,000+ words and phrases, 35,000 mapped 
: concepts.
: 
: 1. 2. 3. 4."

...into the example/exampledocs/solr.xml file in Solr 1.4.1 using the 
field name "attr_darren" (which uses the "textgen" field type with an 
analysis chain matching what you included in your last mail).  When I 
indexed that doc (and nothing else), and look at the list of terms indexed 
in that field using the LukeRequestHandler, i get the oupput below.

In short, i can't reproduce what you are describing at all .. and my best 
guess at what you are seeing (baring the possibility that this is old data 
from when the field type was something else) is that what you think is 
whitespace isn't actaully a space character that the WhitespaceTokenizer 
recognizes.

look at the JSON output from your actual stored value, and verify that 
it's not some funky UTF8 character.


http://localhost:8983/solr/admin/luke?fl=attr_darren&numTerms=1000
  
  ...
  
1
1
1

1
1
1
1
1
1

1
1
1
1
1
1

1
1
1
1
1
1

1
1
1
1
1
1

1
1
1
1
1
1

1
1
1
  









-Hoss


Re: FastVectorHighlighter and SynonymFilter

2010-07-04 Thread Koji Sekiguchi
(10/07/04 23:55), ito hayato wrote:
> Hi all,
>
> I try using concurrently SynonymFilter and
> FastVectorHighlighter,
> then I got empty highlight response.
> As investigation of the causes, I set all combination of
> hl.useVectorHighlighter and expand attribute.
>
> ---
>   ignoreCase="true" expand="false"/>
> ---
>
> Result:
>
>  expand="true" in solrconfig.xml,and
> hl.useFastVectorHighlighter=true
>  -> Returned highlight is empty as below.
>
> ---
> 
> 
> 
> 
> 
> 
> 
> ---
>
>  expand="false" , and hl.useFastVectorHighlighter=true
>  expand="true"  , and hl.useFastVectorHighlighter=false
>  expand="false" , and hl.useFastVectorHighlighter=false
>
>  -> On these cases,Highlighting has correct snippet.
>
>
> Are SynonymFilter and FastVectorHighlighter not supported
> using concurrently?
> These component are not compatible?
>
> In additional:
>  - target field type is defined as following
>  - target field is tokenized by CJKTokenizer.
>  - This problem occured only when I search japanese
> keyword.
>   but not occured when English keyword.
>   (trouble was caused by n-gram tokenize?)
>
> ---
>  
>
>  
>
>
>  
>   synonyms="synonyms.txt" ignore
> Case="true" expand="true"
> tokenizerFactory="solr.CJKTokenizerFactory"/>
>
>  
> ---
>
> --
> 2010 FIFA World Cup News [Yahoo!Sports/sportsnavi]
> http://pr.mail.yahoo.co.jp/southafrica2010/
>
>   
Hello Ito-san,

I think the cause of the problem is that combination of query
time expansion and N-gram tokenizer generates MultiPhraseQuery,
however, FVH doesn't support MPQ.

Koji

-- 
http://www.rondhuit.com/en/



Re: how to apply stemming to the index ?

2010-07-04 Thread Erick Erickson
I'm a little confused about what you're trying to accomplish where.
The fact that you posted to the SOLR users list would indicate
you're using SOLR, in which case all you have to do is apply
the stemming in your config file. Something like:



in your schema.xml file for your index AND search analyzers.

If you're in Lucene, you can add PorterStemFilter to a filter chain
when making our own analyzer (see the synonym example in
Lucene In Action, first or second edition.

If this is gibberish, perhaps you could provide some more context
for what you're trying to accomplish.

HTH
Erick

On Fri, Jul 2, 2010 at 5:08 AM, sarfaraz masood <
sarfarazmasood2...@yahoo.com> wrote:

>
> I want to stem the terms in my index. but currently i am using standard
> analyzer that is not performing any kind of stemming.
>
> StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
>
>
> After some searching i found a code for PorterStemAnalyzer but that is
> having some problems
>
>
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.StopFilter;
> import org.apache.lucene.analysis.LowerCaseTokenizer;
> import org.apache.lucene.analysis.PorterStemFilter;
>
> import java.io.Reader;
> import java.util.Hashtable;
>
>
>  // PorterStemAnalyzer processes input
>  // text by stemming English words to their roots.
>  // This Analyzer also converts the input to lower case
>  // and removes stop words.  A small set of default stop
>  // words is defined in the STOP_WORDS
>  // array, but a caller can specify an alternative set
>  // of stop words by calling non-default constructor.
>
>
> public class PorterStemAnalyzer extends Analyzer
> {
> private static Hashtable _stopTable;
>
>
>  // An array containing some common English words
>  // that are usually not useful for searching.
>
> public static final String[] STOP_WORDS =
> {
> "0", "1", "2", "3", "4", "5", "6", "7", "8",
> "9", "000", "$",
> "about", "after", "all", "also", "an", "and",
> "another", "any", "are", "as", "at", "be",
> "because", "been", "before", "being", "between",
> "both", "but", "by", "came", "can", "come",
> "could", "did", "do", "does", "each", "else",
> "for", "from", "get", "got", "has", "had",
> "he", "have", "her", "here", "him", "himself",
> "his", "how","if", "in", "into", "is", "it",
> "its", "just", "like", "make", "many", "me",
> "might", "more", "most", "much", "must", "my",
> "never", "now", "of", "on", "only", "or",
> "other", "our", "out", "over", "re", "said",
> "same", "see", "should", "since", "so", "some",
> "still", "such", "take", "than", "that", "the",
> "their", "them", "then", "there", "these",
> "they", "this", "those", "through", "to", "too",
> "under", "up", "use", "very", "want", "was",
> "way", "we", "well", "were", "what", "when",
> "where", "which", "while", "who", "will",
> "with", "would", "you", "your",
> "a", "b", "c", "d", "e", "f", "g", "h", "i",
> "j", "k", "l", "m", "n", "o", "p", "q", "r",
> "s", "t", "u", "v", "w", "x", "y", "z"
> };
>
>
>  // Builds an analyzer.
>
> public PorterStemAnalyzer()
> {
> this(STOP_WORDS);
> }
>
>   //Builds an analyzer with the given stop words.
>
>  //@param stopWords a String array of stop words
>
> public PorterStemAnalyzer(String[] stopWords)
> {
> _stopTable = StopFilter.makeStopTable(stopWords);
> }
>
>
>  // Processes the input by first converting it to
>  // lower case, then by eliminating stop words, and
>  // finally by performing Porter stemming on it.
>  //
>  // @param reader the Reader that
>  //   provides access to the input text
>  // @return an instance of TokenStream
>
> public final TokenStream tokenStream(Reader reader)
> {
> return new PorterStemFilter(
> new StopFilter(new LowerCaseTokenizer(reader),
> _stopTable));
> }
> }
>
> *Errors marked in bold.
>
>
> Plz let me know if there is some alternate way to apply stemming to the
> index if this is
>
>
> -Sarfaraz
>
>
>
>


Re: steps to improve search

2010-07-04 Thread Erick Erickson
Yes, when you change the schema in the indexing portion,
it is necessary to reindex the data. You can change the
search parts w/o reindexing..

Also, see this page:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
See the CommonGramsFilterFactory section, which contains
this tidibt:
<<>

HTH
Erick

On Fri, Jul 2, 2010 at 11:38 AM, Frederico Azeiteiro <
frederico.azeite...@cision.com> wrote:

> Thanks Leonardo, I didn't know that tool, very good!
>
> So I see what is wrong:
>
> SnowballPorterFilterFactory and StopFilterFactory. (both used on index and
> query)
>
> I tried remove the snowball and change the stopfilter to "ignorecase=false"
> on QUERY and restarted solr.
>
> But now I get no results :(.
>
> On index analysis I get (result of filters):
> paying  for it
> paying
> paying
> paying
> pay
>
> For Query analysis (result of filters):
> paying  for it
> paying  for it
> paying
> paying
> paying
>
> This means that at the end, the word indexed is "pay" and the searched is
> "paying"?
>
> It's necessary to reindex the data?
>
> Thanks
>
> -Original Message-
> From: Leonardo Menezes [mailto:leonardo.menez...@googlemail.com]
> Sent: sexta-feira, 2 de Julho de 2010 12:58
> To: solr-user@lucene.apache.org
> Subject: Re: steps to improve search
>
> most likely due to:
> EnglishPorterFilterFactory
> RemoveDuplicatesTokenFilterFactory
> StopFilterFactory
>
> you get those "fake" matches. try going into the admin, on the analysis
> section. in there you can "simulate" the index/search of a document, and
> see
> how its actually searched/indexed. it will give you some clues...
>
> On Fri, Jul 2, 2010 at 1:50 PM, Frederico Azeiteiro <
> frederico.azeite...@cision.com> wrote:
>
> > For the example given, I need the full expression "paying for it", so
> > yes all the words.
> > -Original Message-
> > From: Ahmet Arslan [mailto:iori...@yahoo.com]
> > Sent: sexta-feira, 2 de Julho de 2010 12:30
> > To: solr-user@lucene.apache.org
> > Subject: RE: steps to improve search
> >
> > > I need to know how to achieve more accurates queries (like
> > > the example below...) using these filters.
> >
> > do you want that all terms - you search - must appear in returned
> > documents?
> >
> > You can change default operator of QueryParser to AND. either in
> > schema.xml or appending &q.op=AND you your search url. I am assuming you
> > are not using dismax.
> >
> >
> >
> >
>


Re: Boosting based on field values

2010-07-04 Thread Erick Erickson
Wouldn't sorting work in this situation as well?

Erick

On Sun, Jul 4, 2010 at 8:54 AM, Indika Tantrigoda wrote:

> Hi,
>
> {!boost b=pow(1,featured_listing)} is the boost function I used.
>
> Got the results as expected.
>
> Thanks.
>
> Regards,
> Indika
>
>
> On 3 July 2010 21:10, Indika Tantrigoda  wrote:
>
> > Thanks for the info. I'll try this out.
> >
> > Regards,
> > Indika
> >
> >
> > On 3 July 2010 20:48, Ahmet Arslan  wrote:
> >
> >> > I'd like to know if its possible to boost the score of
> >> > documents based on a
> >> > field value. Ex. My schema has a field called isFeatured, and if the
> >> > value of the field
> >> > is true or "1" I'd like to
> >> > have these documents come first in a query result
> >> > regardless of the score.
> >>
> >> Yes it is possible. http://wiki.apache.org/solr/FunctionQuery
> >>
> >>
> http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
> >>
> >> Something like:
> >> &q={!boost b=pow(x,abs(sub(isFeatured,1)),0.5)}yourQuery
> >>
> >>
> >>
> >>
> >>
> >
>


Re: What is the proper procedure to reopen closed bugs?

2010-07-04 Thread Chris Hostetter

: I'd like to reopen a bug SOLR-1960
: https://issues.apache.org/jira/browse/SOLR-1960
: "http://wiki.apache.org/solr/ : non-English users get generic MoinMoin page 
instead of the desired information"

It looks like Koji has already helped you out with SOLR-1960, and that 
issue is obviously a special case for dealing with teh wiki, but in 
general reopening a "closed" bug is almost never the correct thing to do 
-- closed is normally used to indicate that a "fix" has been included in a 
release, so if the bug has re-appeared, you want to open a new bug (linked 
to the old one) so you can track it distinctly - otherwise two different 
releases would both indicate in their CHANGES.txt file that they "fix" the 
same issue.



-Hoss



Re: Solrj Question

2010-07-04 Thread Chris Hostetter

: Subject: Solrj Question
: In-Reply-To: <36b99395-9f0b-45ad-ac05-1d2415833...@yahoo.com>
: References: <36b99395-9f0b-45ad-ac05-1d2415833...@yahoo.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking




-Hoss