expand synonyms without tokenizing stream?

Don Clore Wed, 08 Jul 2009 10:09:45 -0700

I'm pretty new to solr; my apologies if this is a naive question, and my
apologies for the verbosity:
I'd like to take keywords in my documents, and expand them as synonyms; for
example, if the document gets annotated with a keyword of 'sf', I'd like
that to expand to 'San Francisco'.  (San Francisco,San Fran,SF is a line in
my synonyms.txt file).


But I also want to be able to display facets with counts for these keywords;
I'd like them to be suitable for display.

So, if I define the keywords field as 'text', I use the following pipeline
(from my schema.xml):

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">      <analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>        <filter
class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>        <filter
class="solr.StopFilterFactory"                ignoreCase="true"
        words="stopwords.txt"
enablePositionIncrements="true"                />        <filter
class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>        <filter
class="solr.LowerCaseFilterFactory"/>        <filter
class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>      <analyzer type="query">        <tokenizer
class="solr.WhitespaceTokenizerFactory"/>        <filter
class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>        <filter
class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>        <filter
class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>        <filter
class="solr.LowerCaseFilterFactory"/>        <filter
class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>    </fieldType>


Faceting on this field, I get return values (when I query specifically
for the single document in question):

      <lst name="Keywords">
        <int name="fran">1</int>
        <int name="francisco">1</int>
        <int name="san">1</int>
        <int name="sf">1</int>
      </lst>

I've also done a copyfield to a 'KeywordsString' field, which is
defined as "string". i.e.

<fieldType name="string" class="solr.StrField" sortMissingLast="true"
omitNorms="true"/>

Faceting on *that* field (when querying for just this 1 document,
which has a keyword of 'sf'), results in:

      <lst name="KeywordsString">
        <int name="sf">1</int>
      </lst>

I guess what I'd like to see is the ability to stamp keywords like
'sf', 'san fran', 'san francisco', and 'mlb' (with a synonyms.txt file
entry of mlb => Major League Baseball, and see all the documents that
are inscribed with all those synonym variants, come back as:

      <lst name="KeywordsString">
        <int name="San Francisco">1</int>

       <int name="Major League Baseball">1</int>

</lst>


But, I don't know how to define a processing pipeline that expands
synonyms that doesn't tokenize them, breaking 'San Francisco' into
'san' and 'francisco', and presenting those as separate facets.

Thanks for any help,

Don

expand synonyms without tokenizing stream?

Reply via email to