Re: Indexing tweet and searching "@keyword" OR "#keyword"

Mohammad Shariq Wed, 10 Aug 2011 22:38:11 -0700

Do you really want a search on "ipad" to *fail* to match input of "#ipad"?
Or
vice-versa?
My requirement is :  I want to search both '#ipad' and 'ipad' for q='ipad'
BUT for q='#ipad'  I want to search ONLY '#ipad' excluding 'ipad'.



On 10 August 2011 19:49, Erick Erickson <erickerick...@gmail.com> wrote:

> Please look more carefully at the documentation for WDDF,
> specifically:
>
> split on intra-word delimiters (all non alpha-numeric characters).
>
> WordDelimiterFilterFactory will always throw away non alpha-numeric
> characters, you can't tell it do to otherwise. Try some of the other
> tokenizers/analyzers to get what you want, and also look at the
> admin/analysis page to see what the exact effects are of your
> fieldType definitions.
>
> Here's a great place to start:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
> You probably want something like WhitespaceTokenizerFactory
> followed by LowerCaseFilterFactory or some such...
>
> But I really question whether this is what you want either. Do you
> really want a search on "ipad" to *fail* to match input of "#ipad"? Or
> vice-versa?
>
> KeywordTokenizerFactory is probably not the place you want to start,
> the tokenization process doesn't break anything up, you happen to be
> getting separate tokens because of WDDF, which as you see can't
> process things the way you want.
>
>
> Best
> Erick
>
> On Wed, Aug 10, 2011 at 3:09 AM, Mohammad Shariq <shariqn...@gmail.com>
> wrote:
> > I tried tweaking "WordDelimiterFactory" but I won't accept # OR @ symbols
> > and it ignored totally.
> > I need solution plz suggest.
> >
> > On 4 August 2011 21:08, Jonathan Rochkind <rochk...@jhu.edu> wrote:
> >
> >> It's the WordDelimiterFactory in your filter chain that's removing the
> >> punctuation entirely from your index, I think.
> >>
> >> Read up on what the WordDelimiter filter does, and what it's settings
> are;
> >> decide how you want things to be tokenized in your index to get the
> behavior
> >> your want; either get WordDelimiter to do it that way by passing it
> >> different arguments, or stop using WordDelimiter; come back with any
> >> questions after trying that!
> >>
> >>
> >>
> >> On 8/4/2011 11:22 AM, Mohammad Shariq wrote:
> >>
> >>> I have indexed around 1 million tweets ( using  "text" dataType).
> >>> when I search the tweet with "#"  OR "@"  I dont get the exact result.
> >>> e.g.  when I search for "#ipad" OR "@ipad"   I get the result where
> ipad
> >>> is
> >>> mentioned skipping the "#" and "@".
> >>> please suggest me, how to tune or what are filterFactories to use to
> get
> >>> the
> >>> desired result.
> >>> I am indexing the tweet as "text", below is "text" which is there in my
> >>> schema.xml.
> >>>
> >>>
> >>> <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
> >>> <analyzer type="index">
> >>>     <tokenizer class="solr.**KeywordTokenizerFactory"/>
> >>>     <filter class="solr.**CommonGramsFilterFactory"
> words="stopwords.txt"
> >>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
> >>>     <filter class="solr.**WordDelimiterFilterFactory"
> >>> generateWordParts="1"
> >>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> >>> catenateAll="0" splitOnCaseChange="1"/>
> >>>     <filter class="solr.**LowerCaseFilterFactory"/>
> >>>     <filter class="solr.**SnowballPorterFilterFactory"
> >>> protected="protwords.txt" language="English"/>
> >>> </analyzer>
> >>> <analyzer type="query">
> >>>         <tokenizer class="solr.**KeywordTokenizerFactory"/>
> >>>         <filter class="solr.**CommonGramsFilterFactory"
> >>> words="stopwords.txt"
> >>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
> >>>         <filter class="solr.**WordDelimiterFilterFactory"
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>         <filter class="solr.**LowerCaseFilterFactory"/>
> >>>         <filter class="solr.**SnowballPorterFilterFactory"
> >>> protected="protwords.txt" language="English"/>
> >>> </analyzer>
> >>> </fieldType>
> >>>
> >>>
> >
> >
> > --
> > Thanks and Regards
> > Mohammad Shariq
> >
>



-- 
Thanks and Regards
Mohammad Shariq

Re: Indexing tweet and searching "@keyword" OR "#keyword"

Reply via email to