Re: Weird Facet and KeywordTokenizerFactory Issue

Christian Zambrano Tue, 06 Oct 2009 13:53:22 -0700

And you had the analyzer for that field set-up the same way as shown onyour previous e-mail when you indexed the data?



On 10/06/2009 03:46 PM, Ravi Kiran wrote:

I did infact check it out any there is no weirdness in analysis page...see
result below

Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
position 1 term text New York term type word source start,end 0,8 payload


On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano<[email protected]>wrote:

Have you tried using the Analysis page to see what tokens are generated for
the string "New York"? It could be one of the token filter is adding the
token 'new' for all strings that start with 'new'


On 10/06/2009 02:54 PM, Ravi Kiran wrote:

Hello All,
               Iam getting some ghost facets in solr 1.4. Can anybody
kindly
help me understand why I get them and how to eliminate them. My schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
that it will use all words as a single token, am I right ? for example:
"New
York" will be indexed as 'New York' and will not be split right??? However
I
see then splitup in facets as follows when running the query "

http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
"...but
when I search with standard handler qt=standard&q=keyword:"New" I dont
find
any doc which has just "New". After digging in a bit I found that if
several
keywords have a common starting word it is being pulled out as another
facet
like the following. Any help is greatly appreciated

Result
------------
<int name="New">47</int>      -------->   Ghost
<int name="New Hampshire">7</int>
<int name="New Jersey">16</int>
<int name="New Orleans">10</int>
<int name="New York">147</int>
<int name="New York City">23</int>
<int name="New York Giants">8</int>
<int name="New York Islanders">5</int>
<int name="New York Mercantile Exchange">6</int>
<int name="New York Mets">8</int>
<int name="New York Stock Exchange">10</int>
<int name="New York Times">8</int>
<int name="New York University">5</int>
<int name="New Zealand">7</int>

<int name="Energy">7</int>      -------------->   Ghost
<int name="Energy Department">5</int>
<int name="Energy Information Administration">5</int>


<int name="Federal">7</int>    -------------->   Ghost
<int name="Federal Deposit Insurance Corp.">6</int>
<int name="Federal Reserve">26</int>
<int name="Federal Reserve Chairman">6</int>

<int name="North">27</int>
<int name="North Carolina">8</int>
<int name="North Dakota">7</int>
<int name="North Korea">12</int>

Schema.xml
-----------------

     <fieldType name="keywordText" class="solr.TextField"
sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
       <analyzer type="index">
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.TrimFilterFactory" />
         <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt,entity-stopwords.txt"
enablePositionIncrements="true"/>

         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />
         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
       </analyzer>
       <analyzer type="query">
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.TrimFilterFactory" />
         <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"
/>
         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />
         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
       </analyzer>
     </fieldType>

     <field name="person" type="keywordText" indexed="true" stored="true"
multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
     <field name="organization" type="keywordText" indexed="true"
stored="true" multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
     <field name="location" type="keywordText" indexed="true" stored="true"
multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
     <field name="keyword" type="keywordText" indexed="true" stored="true"
multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>

Re: Weird Facet and KeywordTokenizerFactory Issue

Reply via email to