Re: Dilemma - Very Frequent Synonym updates for Huge Index

Jan Høydahl / Cominvent Thu, 01 Jul 2010 16:08:54 -0700

Hi,

Another more complex approach is to design a routine that once in a while 
selectively decides what documents to reindex based on a query on the newly 
added synonym entries, and refeeds those with the new index-side dictionary in 
place. Could work well.


I would consider an architecture where your indexeres only do indexing (except 
at disaster where they can do search as well) - in that case you can happily 
reindex without worrying about affecting user experience.

What exactly is the issue you see with the query-side-only synonym expansion 
when using KeywordTokenizer?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 1. juli 2010, at 20.06, Ravi Kiran wrote:

> Hello Mr. Høydahl,
>                          I thought of doing it exactly as you have said,
> Shall try out and see where I land. However Iam still skeptical about that
> approach from the performance point of view as we are a round the clock news
> organization and huge reindexing might affect the speed of searches moreover
> in the news business "being first" is more important hence we need those
> synonyms to take affect right away and thats where we are in a quandry
> 
>   With regards to the OpenNLP implementation, our design is plain vanilla
> outside of SOLR. We generate the XML on the fly with extracted entities from
> OpenNLP and then index it straight into SOLR. However, we do some sanity
> checks for locations prior to indexing using wordnet so that false positives
> are avoided in location names.
> 
> Thanks,
> 
> Ravi Kiran Bhaskar
> 
> On Thu, Jul 1, 2010 at 5:40 AM, Jan Høydahl / Cominvent <
> jan....@cominvent.com> wrote:
> 
>> Hi,
>> 
>> I think I would look at a hybrid approach, where you keep adding new
>> synonyms to a query-side qynonym dictionary for immediate effect. And then
>> every now and then or every Nth night you move those synonyms over to the
>> index-side dictionary and trigger a full reindex.
>> 
>> A nice side effect of reindexing now and then could be that if your OpenNLP
>> extraction dictionaries have changed, it will be reflected too.
>> 
>> BTW: Could you share details of your OpenNLP integration with us? I'm about
>> to do it on another project..
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>> 
>> On 1. juli 2010, at 06.57, Ravi Kiran wrote:
>> 
>>> Hello,
>>>       Hoping some solr guru can help me out here. We are a news
>>> organization trying to migrate 10 million documents from FAST to solr.
>> The
>>> plan is to have our Editorial team add/modify synonyms multiple times
>> during
>>> a day as they deem appropriate. Hence we plan on using query time
>> synonyms
>>> as we cannot reindex every time they modify the synonyms file(for the
>>> entities extracted by OpenNLP like locations/organizations/person names
>> from
>>> article body) . Since the synonyms are for names Iam concerned that the
>>> multi-phrase issue crops up with the query-time synonyms. for example
>>> synonyms could be as follows
>>> 
>>> The Washington Post Co., The Washington Post, Washington Post, The Post,
>>> TWP, WAPO
>>> DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security
>>> USCIS, United States Citizenship and Immigration Services, U.S.C.I.S.
>>> 
>>> Barack Obama,Barack H. Obama,Barack Hussein Obama,President Obama
>>> Hillary Clinton,Hillary R. Clinton,Hillary Rodham Clinton,Secretary
>>> Clinton,Sen. Clinton
>>> William J. Clinton,William Jefferson Clinton,President Clinton,President
>>> Bill Clinton
>>> 
>>> Virginia, Va., VA
>>> D.C,Washington D.C, District of Columbia
>>> 
>>> I have the following fieldType in schema.xml for the
>> keywords/entites...What
>>> issues should I be aware off ? And is there a better way to achieve it
>>> without having to reindex a million docs on each synonym change. NOTE
>> that I
>>> use tokenizerFactory="solr.KeywordTokenizerFactory" for the
>>> SynonymFilterFactory to keep the words intact without splitting
>>> 
>>>   <!--  Field Type Keywords/Entities Extracted from OpenNLP -->
>>>   <fieldType name="keywordText" class="solr.TextField"
>>> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>>>     <analyzer type="index">
>>>       <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>       <filter class="solr.TrimFilterFactory" />
>>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt,entity-stopwords.txt"
>> enablePositionIncrements="true"/>
>>> 
>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>     </analyzer>
>>>     <analyzer type="query">
>>>       <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>       <filter class="solr.TrimFilterFactory" />
>>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt,entity-stopwords.txt"
>> enablePositionIncrements="true"
>>> />
>>>       <filter class="solr.SynonymFilterFactory"
>>> tokenizerFactory="solr.KeywordTokenizerFactory"
>>> 
>> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
>>> ignoreCase="true" expand="true" />
>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>     </analyzer>
>>>   </fieldType>
>> 
>>

Re: Dilemma - Very Frequent Synonym updates for Huge Index

Reply via email to