Hello Mr.Arslan, Thank you for promptly responding. This solution is for searching topics which would provide a aggregation of all content related to that Topic (like articles/photos/videos etc). So any point of time the user will be searching for one topic only, for example : Barack Obama / Oracle Corp. / Iraq / Gulf Oil Spill. So the user is never allowed to do natural search like entering multiple disparate keywords/entities like "Barack Obama Gulf oil Spill". Bottomline it is entity search. If I did not make any sense to you take a look at what New York Times does in url given below...thats exactly what Iam trying to do
http://topics.nytimes.com/topics/reference/timestopics/all/b/index.html Thanks, Ravi Kiran Bhaskar On Thu, Jul 1, 2010 at 7:04 AM, Ahmet Arslan <iori...@yahoo.com> wrote: > > > --- On Thu, 7/1/10, Ravi Kiran <ravi.bhas...@gmail.com> wrote: > > > From: Ravi Kiran <ravi.bhas...@gmail.com> > > Subject: Dilemma - Very Frequent Synonym updates for Huge Index > > To: solr-user@lucene.apache.org > > Date: Thursday, July 1, 2010, 7:57 AM > > Hello, > > Hoping some solr guru can help > > me out here. We are a news > > organization trying to migrate 10 million documents from > > FAST to solr. The > > plan is to have our Editorial team add/modify synonyms > > multiple times during > > a day as they deem appropriate. Hence we plan on using > > query time synonyms > > as we cannot reindex every time they modify the synonyms > > file(for the > > entities extracted by OpenNLP like > > locations/organizations/person names from > > article body) . Since the synonyms are for names Iam > > concerned that the > > multi-phrase issue crops up with the query-time synonyms. > > for example > > synonyms could be as follows > > > > The Washington Post Co., The Washington Post, Washington > > Post, The Post, > > TWP, WAPO > > DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland > > Security > > USCIS, United States Citizenship and Immigration Services, > > U.S.C.I.S. > > > > Barack Obama,Barack H. Obama,Barack Hussein Obama,President > > Obama > > Hillary Clinton,Hillary R. Clinton,Hillary Rodham > > Clinton,Secretary > > Clinton,Sen. Clinton > > William J. Clinton,William Jefferson Clinton,President > > Clinton,President > > Bill Clinton > > > > Virginia, Va., VA > > D.C,Washington D.C, District of Columbia > > > > I have the following fieldType in schema.xml for the > > keywords/entites...What > > issues should I be aware off ? And is there a better way to > > achieve it > > without having to reindex a million docs on each synonym > > change. NOTE that I > > use tokenizerFactory="solr.KeywordTokenizerFactory" for > > the > > SynonymFilterFactory to keep the words intact without > > splitting > > > > <!-- Field Type Keywords/Entities > > Extracted from OpenNLP --> > > <fieldType name="keywordText" > > class="solr.TextField" > > sortMissingLast="true" omitNorms="true" > > positionIncrementGap="100"> > > <analyzer type="index"> > > <tokenizer > > class="solr.KeywordTokenizerFactory"/> > > <filter > > class="solr.TrimFilterFactory" /> > > <filter > > class="solr.StopFilterFactory" ignoreCase="true" > > words="stopwords.txt,entity-stopwords.txt" > > enablePositionIncrements="true"/> > > > > <filter > > class="solr.RemoveDuplicatesTokenFilterFactory"/> > > </analyzer> > > <analyzer type="query"> > > <tokenizer > > class="solr.KeywordTokenizerFactory"/> > > <filter > > class="solr.TrimFilterFactory" /> > > <filter > > class="solr.StopFilterFactory" ignoreCase="true" > > words="stopwords.txt,entity-stopwords.txt" > > enablePositionIncrements="true" > > /> > > <filter > > class="solr.SynonymFilterFactory" > > tokenizerFactory="solr.KeywordTokenizerFactory" > > > synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt" > > ignoreCase="true" expand="true" /> > > <filter > > class="solr.RemoveDuplicatesTokenFilterFactory"/> > > </analyzer> > > </fieldType> > > > > Have ever used this fieldType? Search on this field will be troublesome. > You need to search exactly same entries as in your synonym.txt. Additional > you need to use raw or field query parser. Because query text is spitted at > white-spaces before it reaches KeywordTokenizer. > > For example: q=keywordText:(Washington Post Bill Clinton)&debugQuery=on > > > >