Re: Weird Facet and KeywordTokenizerFactory Issue

Christian Zambrano Tue, 06 Oct 2009 15:04:18 -0700

Got it. Sorry for not having an answer for your problem.


On 10/06/2009 04:58 PM, Ravi Kiran wrote:

You dont see any facet fields in my query because I have configured them in
the solrconfig.xml to give specific fields as facets by default in the
dismax and standard handlers so that I dont have to specify all those fields
individually everytime I query, all I need to do is just set facet=true
thats all

   <requestHandler name="dismax" class="solr.SearchHandler" default="true">
     <lst name="defaults">
      <str name="defType">dismax</str>
      <str name="echoParams">explicit</str>
      <float name="tie">0.01</float>
      <str name="qf">
         systemid^20.0 headline^20.0 keyword^18.0 person^18.0
organization^18.0 usstate^18.0 country^18.0 subject^18.0 quote^18.0
blurb^15.0 articlesubhead^8.0 byline^7.0 articleblurb^2.0 body^1.5
multimediablurb^1.5
      </str>
      <str name="pf">
         headline^20.5 keyword^18.5 person^18.5 organization^18.5
usstate^18.5 country^18.5 subject^18.5 quote^18.5 blurb^15.5
articlesubhead^8.5 byline^7.5 articleblurb^2.5 body^2.0 multimediablurb^2.0
      </str>
      <str name="bf">
         recip(rord(pubdatetime),1,1000,1000)^1.0
      </str>
      <str name="fl">
         *
      </str>
      <str name="mm">
         2&lt;-1 5&lt;-3 6&lt;90%
      </str>
      <int name="ps">100</int>
      <str name="q.alt">*:*</str>
      <!-- example highlighter config, enable per-query with hl=true -->
      <str name="hl.fl">keyword</str>
      <!-- for this field, we want no fragmenting, just highlighting -->
      <str name="f.body.hl.fragsize">0</str>
      <!-- instructs Solr to return the field itself if no query terms are
found -->
      <str name="f.name.hl.alternateField">keyword</str>
      <str name="f.text.hl.fragmenter">regex</str>  <!-- defined below -->
      <str name="facet">false</str>
      <int name="facet.mincount">1</int>
      <int name="f.keyword.facet.mincount">5</int>
      <int name="f.keywordlower.facet.mincount">5</int>
      <int name="f.keywordformatted.facet.mincount">5</int>
      <int name="f.person.facet.mincount">5</int>
      <int name="f.personformatted.facet.mincount">5</int>
      <int name="f.organization.facet.mincount">5</int>
      <str name="facet.field">contenttype</str>
      <str name="facet.field">keyword</str>
      <str name="facet.field">keywordlower</str>
      <str name="facet.field">keywordformatted</str>
      <str name="facet.field">person</str>
      <str name="facet.field">personformatted</str>
      <str name="facet.field">organization</str>
      <str name="facet.field">usstate</str>
      <str name="facet.field">country</str>
      <str name="facet.field">subject</str>
     </lst>
   </requestHandler>


On Tue, Oct 6, 2009 at 5:45 PM, Christian Zambrano<czamb...@gmail.com>wrote:

I am stumped then. I had a similar issue when I was using a field that was
being heavily tokenized, but I corrected the issue by using a
field(generated using copyField) that doesn't get analyzed at all.

On the query you provided before I didn't see the parameters to tell solr
for which field it should produce facets.

Something like:


http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1&*facet.field=location*




On 10/06/2009 04:09 PM, Ravi Kiran wrote:

Yes Exactly the same

On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambrano<czamb...@gmail.com

wrote:

And you had the analyzer for that field set-up the same way as shown on
your previous e-mail when you indexed the data?




On 10/06/2009 03:46 PM, Ravi Kiran wrote:

I did infact check it out any there is no weirdness in analysis
page...see
result below

Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8
payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term
type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
position 1 term text New York term type word source start,end 0,8
payload
  Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}
  term
position 1 term text New York term type word source start,end 0,8
payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term
type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
position 1 term text New York term type word source start,end 0,8
payload


On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano<czamb...@gmail.com

wrote:

Have you tried using the Analysis page to see what tokens are generated
for
the string "New York"? It could be one of the token filter is adding
the
token 'new' for all strings that start with 'new'


On 10/06/2009 02:54 PM, Ravi Kiran wrote:

Hello All,
               Iam getting some ghost facets in solr 1.4. Can anybody
kindly
help me understand why I get them and how to eliminate them. My
schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory
is
that it will use all words as a single token, am I right ? for
example:
"New
York" will be indexed as 'New York' and will not be split right???
However
I
see then splitup in facets as follows when running the query "



http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
"...but
when I search with standard handler qt=standard&q=keyword:"New" I dont
find
any doc which has just "New". After digging in a bit I found that if
several
keywords have a common starting word it is being pulled out as another
facet
like the following. Any help is greatly appreciated

Result
------------
<int name="New">47</int>        -------->     Ghost
<int name="New Hampshire">7</int>
<int name="New Jersey">16</int>
<int name="New Orleans">10</int>
<int name="New York">147</int>
<int name="New York City">23</int>
<int name="New York Giants">8</int>
<int name="New York Islanders">5</int>
<int name="New York Mercantile Exchange">6</int>
<int name="New York Mets">8</int>
<int name="New York Stock Exchange">10</int>
<int name="New York Times">8</int>
<int name="New York University">5</int>
<int name="New Zealand">7</int>

<int name="Energy">7</int>        -------------->     Ghost
<int name="Energy Department">5</int>
<int name="Energy Information Administration">5</int>


<int name="Federal">7</int>      -------------->     Ghost
<int name="Federal Deposit Insurance Corp.">6</int>
<int name="Federal Reserve">26</int>
<int name="Federal Reserve Chairman">6</int>

<int name="North">27</int>
<int name="North Carolina">8</int>
<int name="North Dakota">7</int>
<int name="North Korea">12</int>

Schema.xml
-----------------

     <fieldType name="keywordText" class="solr.TextField"
sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
       <analyzer type="index">
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.TrimFilterFactory" />
         <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt,entity-stopwords.txt"
enablePositionIncrements="true"/>

         <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt"
ignoreCase="true" expand="false" />
         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
       </analyzer>
       <analyzer type="query">
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.TrimFilterFactory" />
         <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt,entity-stopwords.txt"
enablePositionIncrements="true"
/>
         <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt"
ignoreCase="true" expand="false" />
         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
       </analyzer>
     </fieldType>

     <field name="person" type="keywordText" indexed="true"
stored="true"
multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
     <field name="organization" type="keywordText" indexed="true"
stored="true" multiValued="true" termVectors="false"
termPositions="false"
termOffsets="false"/>
     <field name="location" type="keywordText" indexed="true"
stored="true"
multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
     <field name="keyword" type="keywordText" indexed="true"
stored="true"
multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>

Re: Weird Facet and KeywordTokenizerFactory Issue

Reply via email to