Hi, Another more complex approach is to design a routine that once in a while selectively decides what documents to reindex based on a query on the newly added synonym entries, and refeeds those with the new index-side dictionary in place. Could work well.
I would consider an architecture where your indexeres only do indexing (except at disaster where they can do search as well) - in that case you can happily reindex without worrying about affecting user experience. What exactly is the issue you see with the query-side-only synonym expansion when using KeywordTokenizer? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 1. juli 2010, at 20.06, Ravi Kiran wrote: > Hello Mr. Høydahl, > I thought of doing it exactly as you have said, > Shall try out and see where I land. However Iam still skeptical about that > approach from the performance point of view as we are a round the clock news > organization and huge reindexing might affect the speed of searches moreover > in the news business "being first" is more important hence we need those > synonyms to take affect right away and thats where we are in a quandry > > With regards to the OpenNLP implementation, our design is plain vanilla > outside of SOLR. We generate the XML on the fly with extracted entities from > OpenNLP and then index it straight into SOLR. However, we do some sanity > checks for locations prior to indexing using wordnet so that false positives > are avoided in location names. > > Thanks, > > Ravi Kiran Bhaskar > > On Thu, Jul 1, 2010 at 5:40 AM, Jan Høydahl / Cominvent < > jan....@cominvent.com> wrote: > >> Hi, >> >> I think I would look at a hybrid approach, where you keep adding new >> synonyms to a query-side qynonym dictionary for immediate effect. And then >> every now and then or every Nth night you move those synonyms over to the >> index-side dictionary and trigger a full reindex. >> >> A nice side effect of reindexing now and then could be that if your OpenNLP >> extraction dictionaries have changed, it will be reflected too. >> >> BTW: Could you share details of your OpenNLP integration with us? I'm about >> to do it on another project.. >> >> -- >> Jan Høydahl, search solution architect >> Cominvent AS - www.cominvent.com >> Training in Europe - www.solrtraining.com >> >> On 1. juli 2010, at 06.57, Ravi Kiran wrote: >> >>> Hello, >>> Hoping some solr guru can help me out here. We are a news >>> organization trying to migrate 10 million documents from FAST to solr. >> The >>> plan is to have our Editorial team add/modify synonyms multiple times >> during >>> a day as they deem appropriate. Hence we plan on using query time >> synonyms >>> as we cannot reindex every time they modify the synonyms file(for the >>> entities extracted by OpenNLP like locations/organizations/person names >> from >>> article body) . Since the synonyms are for names Iam concerned that the >>> multi-phrase issue crops up with the query-time synonyms. for example >>> synonyms could be as follows >>> >>> The Washington Post Co., The Washington Post, Washington Post, The Post, >>> TWP, WAPO >>> DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security >>> USCIS, United States Citizenship and Immigration Services, U.S.C.I.S. >>> >>> Barack Obama,Barack H. Obama,Barack Hussein Obama,President Obama >>> Hillary Clinton,Hillary R. Clinton,Hillary Rodham Clinton,Secretary >>> Clinton,Sen. Clinton >>> William J. Clinton,William Jefferson Clinton,President Clinton,President >>> Bill Clinton >>> >>> Virginia, Va., VA >>> D.C,Washington D.C, District of Columbia >>> >>> I have the following fieldType in schema.xml for the >> keywords/entites...What >>> issues should I be aware off ? And is there a better way to achieve it >>> without having to reindex a million docs on each synonym change. NOTE >> that I >>> use tokenizerFactory="solr.KeywordTokenizerFactory" for the >>> SynonymFilterFactory to keep the words intact without splitting >>> >>> <!-- Field Type Keywords/Entities Extracted from OpenNLP --> >>> <fieldType name="keywordText" class="solr.TextField" >>> sortMissingLast="true" omitNorms="true" positionIncrementGap="100"> >>> <analyzer type="index"> >>> <tokenizer class="solr.KeywordTokenizerFactory"/> >>> <filter class="solr.TrimFilterFactory" /> >>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>> words="stopwords.txt,entity-stopwords.txt" >> enablePositionIncrements="true"/> >>> >>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >>> </analyzer> >>> <analyzer type="query"> >>> <tokenizer class="solr.KeywordTokenizerFactory"/> >>> <filter class="solr.TrimFilterFactory" /> >>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>> words="stopwords.txt,entity-stopwords.txt" >> enablePositionIncrements="true" >>> /> >>> <filter class="solr.SynonymFilterFactory" >>> tokenizerFactory="solr.KeywordTokenizerFactory" >>> >> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt" >>> ignoreCase="true" expand="true" /> >>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >>> </analyzer> >>> </fieldType> >> >>