Re: Facets, termvectors, relevancy and Multi word tokenizing

Ahmet Arslan Thu, 27 Feb 2014 13:54:34 -0800


Hi epnRui,


I don't full follow your e-mail (I think you need to describe your use case) 
but here are some answers,

- Is it possible to have facets of two or more words?

Yes. For example if you use ShingleFilterFactory at index time you will see two 
or more words in facets.


- Can I tokenize a phrase into words, but when it comes accross "European
Union", it generates one token for "European Union" and not two tokens
"European Union"?


Yes. For example you can use mappingCharFilter (executed before tokenizer) with 
this mapping :
"European Union" => "European_Union"


Regarding synonym filter, please see : 
http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/

Ahmet


On Thursday, February 27, 2014 1:10 PM, epnRui <rui_banda...@hotmail.com> wrote:
Hi everyone!

I'm having a problem and I have searched and Haven't found a solution yet
and am rather confused at the moment.

I have an application that stores human readable texts in my Solr index.
It finds the most relevant terms in that human readable text, I think using
termvectors and facets, and it stores the facets terms.

All works fine but now I need that the most relevant terms can also be terms
of at least two words, like "European Union", which is quite a frequent term
in my system...Still the system is getting into the facets "European"
"Union" as two separate terms.

So, questions are:
- Is it possible to have facets of two or more words?
- Can I tokenize a phrase into words, but when it comes accross "European
Union", it generates one token for "European Union" and not two tokens
"European Union"?
- Can termvectors be used to find relevancy of multi-word terms like
"European Union" ?
- Can I use SynonymFilterFactory that would transform: "EU, UE, European
Union, Union Europeene" into "European Union" ?

At the moment of indexation I have the following analyzer for english
language:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" words="blacklist.txt"
ignoreCase="true"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" words="en"
ignoreCase="true"/>
        <filter class="solr.HunspellStemFilterFactory"
dictionary="en_GB.dic" affix="en_GB.aff" ignoreCase="true" />
      </analyzer>
    </fieldType>


Thank you for the help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

Reply via email to