Hi everyone!

I'm having a problem and I have searched and Haven't found a solution yet
and am rather confused at the moment.

I have an application that stores human readable texts in my Solr index.
It finds the most relevant terms in that human readable text, I think using
termvectors and facets, and it stores the facets terms.

All works fine but now I need that the most relevant terms can also be terms
of at least two words, like "European Union", which is quite a frequent term
in my system...Still the system is getting into the facets "European"
"Union" as two separate terms.

So, questions are:
 - Is it possible to have facets of two or more words?
 - Can I tokenize a phrase into words, but when it comes accross "European
Union", it generates one token for "European Union" and not two tokens
"European Union"?
 - Can termvectors be used to find relevancy of multi-word terms like
"European Union" ?
 - Can I use SynonymFilterFactory that would transform: "EU, UE, European
Union, Union Europeene" into "European Union" ?

At the moment of indexation I have the following analyzer for english
language:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" words="blacklist.txt"
ignoreCase="true"/>
        <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.SynonymFilterFactory" 
synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" words="en"
ignoreCase="true"/>
        <filter class="solr.HunspellStemFilterFactory"
dictionary="en_GB.dic" affix="en_GB.aff" ignoreCase="true" />
      </analyzer>
    </fieldType>


Thank you for the help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to