Dilemma - Very Frequent Synonym updates for Huge Index

Ravi Kiran Wed, 30 Jun 2010 21:59:00 -0700

Hello,
        Hoping some solr guru can help me out here. We are a news
organization trying to migrate 10 million documents from FAST to solr. The
plan is to have our Editorial team add/modify synonyms multiple times during
a day as they deem appropriate. Hence we plan on using query time synonyms
as we cannot reindex every time they modify the synonyms file(for the
entities extracted by OpenNLP like locations/organizations/person names from
article body) . Since the synonyms are for names Iam concerned that the
multi-phrase issue crops up with the query-time synonyms. for example
synonyms could be as follows


The Washington Post Co., The Washington Post, Washington Post, The Post,
TWP, WAPO
DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security
USCIS, United States Citizenship and Immigration Services, U.S.C.I.S.

Barack Obama,Barack H. Obama,Barack Hussein Obama,President Obama
Hillary Clinton,Hillary R. Clinton,Hillary Rodham Clinton,Secretary
Clinton,Sen. Clinton
William J. Clinton,William Jefferson Clinton,President Clinton,President
Bill Clinton

Virginia, Va., VA
D.C,Washington D.C, District of Columbia

I have the following fieldType in schema.xml for the keywords/entites...What
issues should I be aware off ? And is there a better way to achieve it
without having to reindex a million docs on each synonym change. NOTE that I
use tokenizerFactory="solr.KeywordTokenizerFactory" for the
SynonymFilterFactory to keep the words intact without splitting

    <!--  Field Type Keywords/Entities Extracted from OpenNLP -->
    <fieldType name="keywordText" class="solr.TextField"
sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"/>

        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"
/>
        <filter class="solr.SynonymFilterFactory"
tokenizerFactory="solr.KeywordTokenizerFactory"
synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
ignoreCase="true" expand="true" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

Dilemma - Very Frequent Synonym updates for Huge Index

Reply via email to