Re: Dilemma - Very Frequent Synonym updates for Huge Index

Ravi Kiran Sun, 04 Jul 2010 10:53:18 -0700

Hello Mr.Høydahl,
                          Yes your are right, we can selectively reindex
which would reduce the amount of indexing, but not by much for commonly
occurring entities. For example: George W. Bush / Barack Obama /Afghanistan
/ Iraq etc occurs in most of the documents in the last 5 years so they will
be a couple of million docs reindexed everytime. BTW my boss has mentioned I
wont be getting any new server due to budget constraints, so Iam stuck with
a single machine to do both reindex and searches.


With Query-Side-Only synonyms (no index time synonyms as Facets dont honor
synonyms) the issue would be all variations of the name will be displayed as
I use the field as a multiValued Facet field and display it (Our
requirements want only one variation shown as it will be easy to use a
alphabetical listing like A, B, C...Z).

I know it is not the right kind of design, considering millions of entities
should not be made Facets, but my business requirements also state that only
if there are more than 5 occurrences of an entity it is eligible for
display....and hence I can use facet.keyword.mincount=5 configured into my
solrconfig.xml which is quite easy. Thats my motivation for using Facets.

Ideally for my SynonymFilter I want expand="false" (to make sure only one
variant shows in display) at index time and expand="true" at query time (so
that newly added synonym on core reload will instantly work). But an inner
class method called MultiPhraseWeight.scorer in MultiPhraseQuery' throws
errors because of Multi-Word synonyms probably are not supported at query
time. I donot know why solr chose to use WhiteSpaceTokenizer even when the
tokenizer for a field is explicitly defined in the schema.xml (in my case
KeywordTokenizer)

Thanks for your continued interest in answering my questions.

Ravi Kiran Bhaskar


On Thu, Jul 1, 2010 at 7:08 PM, Jan Høydahl / Cominvent <
jan....@cominvent.com> wrote:

> Hi,
>
> Another more complex approach is to design a routine that once in a while
> selectively decides what documents to reindex based on a query on the newly
> added synonym entries, and refeeds those with the new index-side dictionary
> in place. Could work well.
>
> I would consider an architecture where your indexeres only do indexing
> (except at disaster where they can do search as well) - in that case you can
> happily reindex without worrying about affecting user experience.
>
> What exactly is the issue you see with the query-side-only synonym
> expansion when using KeywordTokenizer?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 1. juli 2010, at 20.06, Ravi Kiran wrote:
>
> > Hello Mr. Høydahl,
> >                          I thought of doing it exactly as you have said,
> > Shall try out and see where I land. However Iam still skeptical about
> that
> > approach from the performance point of view as we are a round the clock
> news
> > organization and huge reindexing might affect the speed of searches
> moreover
> > in the news business "being first" is more important hence we need those
> > synonyms to take affect right away and thats where we are in a quandry
> >
> >   With regards to the OpenNLP implementation, our design is plain vanilla
> > outside of SOLR. We generate the XML on the fly with extracted entities
> from
> > OpenNLP and then index it straight into SOLR. However, we do some sanity
> > checks for locations prior to indexing using wordnet so that false
> positives
> > are avoided in location names.
> >
> > Thanks,
> >
> > Ravi Kiran Bhaskar
> >
> > On Thu, Jul 1, 2010 at 5:40 AM, Jan Høydahl / Cominvent <
> > jan....@cominvent.com> wrote:
> >
> >> Hi,
> >>
> >> I think I would look at a hybrid approach, where you keep adding new
> >> synonyms to a query-side qynonym dictionary for immediate effect. And
> then
> >> every now and then or every Nth night you move those synonyms over to
> the
> >> index-side dictionary and trigger a full reindex.
> >>
> >> A nice side effect of reindexing now and then could be that if your
> OpenNLP
> >> extraction dictionaries have changed, it will be reflected too.
> >>
> >> BTW: Could you share details of your OpenNLP integration with us? I'm
> about
> >> to do it on another project..
> >>
> >> --
> >> Jan Høydahl, search solution architect
> >> Cominvent AS - www.cominvent.com
> >> Training in Europe - www.solrtraining.com
> >>
> >> On 1. juli 2010, at 06.57, Ravi Kiran wrote:
> >>
> >>> Hello,
> >>>       Hoping some solr guru can help me out here. We are a news
> >>> organization trying to migrate 10 million documents from FAST to solr.
> >> The
> >>> plan is to have our Editorial team add/modify synonyms multiple times
> >> during
> >>> a day as they deem appropriate. Hence we plan on using query time
> >> synonyms
> >>> as we cannot reindex every time they modify the synonyms file(for the
> >>> entities extracted by OpenNLP like locations/organizations/person names
> >> from
> >>> article body) . Since the synonyms are for names Iam concerned that the
> >>> multi-phrase issue crops up with the query-time synonyms. for example
> >>> synonyms could be as follows
> >>>
> >>> The Washington Post Co., The Washington Post, Washington Post, The
> Post,
> >>> TWP, WAPO
> >>> DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security
> >>> USCIS, United States Citizenship and Immigration Services, U.S.C.I.S.
> >>>
> >>> Barack Obama,Barack H. Obama,Barack Hussein Obama,President Obama
> >>> Hillary Clinton,Hillary R. Clinton,Hillary Rodham Clinton,Secretary
> >>> Clinton,Sen. Clinton
> >>> William J. Clinton,William Jefferson Clinton,President
> Clinton,President
> >>> Bill Clinton
> >>>
> >>> Virginia, Va., VA
> >>> D.C,Washington D.C, District of Columbia
> >>>
> >>> I have the following fieldType in schema.xml for the
> >> keywords/entites...What
> >>> issues should I be aware off ? And is there a better way to achieve it
> >>> without having to reindex a million docs on each synonym change. NOTE
> >> that I
> >>> use tokenizerFactory="solr.KeywordTokenizerFactory" for the
> >>> SynonymFilterFactory to keep the words intact without splitting
> >>>
> >>>   <!--  Field Type Keywords/Entities Extracted from OpenNLP -->
> >>>   <fieldType name="keywordText" class="solr.TextField"
> >>> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
> >>>     <analyzer type="index">
> >>>       <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>       <filter class="solr.TrimFilterFactory" />
> >>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>> words="stopwords.txt,entity-stopwords.txt"
> >> enablePositionIncrements="true"/>
> >>>
> >>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>>     </analyzer>
> >>>     <analyzer type="query">
> >>>       <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>       <filter class="solr.TrimFilterFactory" />
> >>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>> words="stopwords.txt,entity-stopwords.txt"
> >> enablePositionIncrements="true"
> >>> />
> >>>       <filter class="solr.SynonymFilterFactory"
> >>> tokenizerFactory="solr.KeywordTokenizerFactory"
> >>>
> >>
> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
> >>> ignoreCase="true" expand="true" />
> >>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>>     </analyzer>
> >>>   </fieldType>
> >>
> >>
>
>

Re: Dilemma - Very Frequent Synonym updates for Huge Index

Reply via email to