Re: Weird Facet and KeywordTokenizerFactory Issue

Ravi Kiran Tue, 06 Oct 2009 14:10:17 -0700

Yes Exactly the same

On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambrano <czamb...@gmail.com>wrote:


> And you had the analyzer for that field set-up the same way as shown on
> your previous e-mail when you indexed the data?
>
>
>
>
> On 10/06/2009 03:46 PM, Ravi Kiran wrote:
>
>> I did infact check it out any there is no weirdness in analysis page...see
>> result below
>>
>> Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
>> position 1 term text New York term type word source start,end 0,8 payload
>>  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text
>> New
>> York term type word source start,end 0,8 payload
>>  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
>> ignoreCase=true, enablePositionIncrements=true}  term position 1 term text
>> New
>> York term type word source start,end 0,8 payload
>>  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
>> expand=false, ignoreCase=true}  term position 1 term text New York term
>> type
>> word source start,end 0,8 payload
>>  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
>> position 1 term text New York term type word source start,end 0,8 payload
>>  Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
>> position 1 term text New York term type word source start,end 0,8 payload
>>  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text
>> New
>> York term type word source start,end 0,8 payload
>>  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
>> ignoreCase=true, enablePositionIncrements=true}  term position 1 term text
>> New
>> York term type word source start,end 0,8 payload
>>  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
>> expand=false, ignoreCase=true}  term position 1 term text New York term
>> type
>> word source start,end 0,8 payload
>>  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
>> position 1 term text New York term type word source start,end 0,8 payload
>>
>>
>> On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano<czamb...@gmail.com
>> >wrote:
>>
>>
>>
>>> Have you tried using the Analysis page to see what tokens are generated
>>> for
>>> the string "New York"? It could be one of the token filter is adding the
>>> token 'new' for all strings that start with 'new'
>>>
>>>
>>> On 10/06/2009 02:54 PM, Ravi Kiran wrote:
>>>
>>>
>>>
>>>> Hello All,
>>>>               Iam getting some ghost facets in solr 1.4. Can anybody
>>>> kindly
>>>> help me understand why I get them and how to eliminate them. My
>>>> schema.xml
>>>> snippet is given at the end. Iam indexing Named Entities extracted via
>>>> OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
>>>> that it will use all words as a single token, am I right ? for example:
>>>> "New
>>>> York" will be indexed as 'New York' and will not be split right???
>>>> However
>>>> I
>>>> see then splitup in facets as follows when running the query "
>>>>
>>>>
>>>> http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
>>>> "...but
>>>> when I search with standard handler qt=standard&q=keyword:"New" I dont
>>>> find
>>>> any doc which has just "New". After digging in a bit I found that if
>>>> several
>>>> keywords have a common starting word it is being pulled out as another
>>>> facet
>>>> like the following. Any help is greatly appreciated
>>>>
>>>> Result
>>>> ------------
>>>> <int name="New">47</int>      -------->   Ghost
>>>> <int name="New Hampshire">7</int>
>>>> <int name="New Jersey">16</int>
>>>> <int name="New Orleans">10</int>
>>>> <int name="New York">147</int>
>>>> <int name="New York City">23</int>
>>>> <int name="New York Giants">8</int>
>>>> <int name="New York Islanders">5</int>
>>>> <int name="New York Mercantile Exchange">6</int>
>>>> <int name="New York Mets">8</int>
>>>> <int name="New York Stock Exchange">10</int>
>>>> <int name="New York Times">8</int>
>>>> <int name="New York University">5</int>
>>>> <int name="New Zealand">7</int>
>>>>
>>>> <int name="Energy">7</int>      -------------->   Ghost
>>>> <int name="Energy Department">5</int>
>>>> <int name="Energy Information Administration">5</int>
>>>>
>>>>
>>>> <int name="Federal">7</int>    -------------->   Ghost
>>>> <int name="Federal Deposit Insurance Corp.">6</int>
>>>> <int name="Federal Reserve">26</int>
>>>> <int name="Federal Reserve Chairman">6</int>
>>>>
>>>> <int name="North">27</int>
>>>> <int name="North Carolina">8</int>
>>>> <int name="North Dakota">7</int>
>>>> <int name="North Korea">12</int>
>>>>
>>>> Schema.xml
>>>> -----------------
>>>>
>>>>     <fieldType name="keywordText" class="solr.TextField"
>>>> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>>>>       <analyzer type="index">
>>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>         <filter class="solr.TrimFilterFactory" />
>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords.txt,entity-stopwords.txt"
>>>> enablePositionIncrements="true"/>
>>>>
>>>>         <filter class="solr.SynonymFilterFactory"
>>>> synonyms="synonyms.txt"
>>>> ignoreCase="true" expand="false" />
>>>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>       </analyzer>
>>>>       <analyzer type="query">
>>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>         <filter class="solr.TrimFilterFactory" />
>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords.txt,entity-stopwords.txt"
>>>> enablePositionIncrements="true"
>>>> />
>>>>         <filter class="solr.SynonymFilterFactory"
>>>> synonyms="synonyms.txt"
>>>> ignoreCase="true" expand="false" />
>>>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>       </analyzer>
>>>>     </fieldType>
>>>>
>>>>     <field name="person" type="keywordText" indexed="true" stored="true"
>>>> multiValued="true" termVectors="false" termPositions="false"
>>>> termOffsets="false"/>
>>>>     <field name="organization" type="keywordText" indexed="true"
>>>> stored="true" multiValued="true" termVectors="false"
>>>> termPositions="false"
>>>> termOffsets="false"/>
>>>>     <field name="location" type="keywordText" indexed="true"
>>>> stored="true"
>>>> multiValued="true" termVectors="false" termPositions="false"
>>>> termOffsets="false"/>
>>>>     <field name="keyword" type="keywordText" indexed="true"
>>>> stored="true"
>>>> multiValued="true" termVectors="false" termPositions="false"
>>>> termOffsets="false"/>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Weird Facet and KeywordTokenizerFactory Issue

Reply via email to