Re: Indexing tweet and searching "@keyword" OR "#keyword"

Erick Erickson Sat, 13 Aug 2011 07:52:46 -0700

I don't see an easy way to do that with the standard set of
filters. You'll probably need to write something custom (note,
this is actually pretty easy). I suspect you'll
need to do something like Synonyms, where when you
get a token like #ipod, you essentially make it a synonym
for ipod and insert both in the document...


This assumes you can't create a list of all the terms you want
treated this way, because you could just synonyms if you could.


Best
Erick

On Thu, Aug 11, 2011 at 1:37 AM, Mohammad Shariq <shariqn...@gmail.com> wrote:
> Do you really want a search on "ipad" to *fail* to match input of "#ipad"?
> Or
> vice-versa?
> My requirement is :  I want to search both '#ipad' and 'ipad' for q='ipad'
> BUT for q='#ipad'  I want to search ONLY '#ipad' excluding 'ipad'.
>
>
> On 10 August 2011 19:49, Erick Erickson <erickerick...@gmail.com> wrote:
>
>> Please look more carefully at the documentation for WDDF,
>> specifically:
>>
>> split on intra-word delimiters (all non alpha-numeric characters).
>>
>> WordDelimiterFilterFactory will always throw away non alpha-numeric
>> characters, you can't tell it do to otherwise. Try some of the other
>> tokenizers/analyzers to get what you want, and also look at the
>> admin/analysis page to see what the exact effects are of your
>> fieldType definitions.
>>
>> Here's a great place to start:
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>>
>> You probably want something like WhitespaceTokenizerFactory
>> followed by LowerCaseFilterFactory or some such...
>>
>> But I really question whether this is what you want either. Do you
>> really want a search on "ipad" to *fail* to match input of "#ipad"? Or
>> vice-versa?
>>
>> KeywordTokenizerFactory is probably not the place you want to start,
>> the tokenization process doesn't break anything up, you happen to be
>> getting separate tokens because of WDDF, which as you see can't
>> process things the way you want.
>>
>>
>> Best
>> Erick
>>
>> On Wed, Aug 10, 2011 at 3:09 AM, Mohammad Shariq <shariqn...@gmail.com>
>> wrote:
>> > I tried tweaking "WordDelimiterFactory" but I won't accept # OR @ symbols
>> > and it ignored totally.
>> > I need solution plz suggest.
>> >
>> > On 4 August 2011 21:08, Jonathan Rochkind <rochk...@jhu.edu> wrote:
>> >
>> >> It's the WordDelimiterFactory in your filter chain that's removing the
>> >> punctuation entirely from your index, I think.
>> >>
>> >> Read up on what the WordDelimiter filter does, and what it's settings
>> are;
>> >> decide how you want things to be tokenized in your index to get the
>> behavior
>> >> your want; either get WordDelimiter to do it that way by passing it
>> >> different arguments, or stop using WordDelimiter; come back with any
>> >> questions after trying that!
>> >>
>> >>
>> >>
>> >> On 8/4/2011 11:22 AM, Mohammad Shariq wrote:
>> >>
>> >>> I have indexed around 1 million tweets ( using  "text" dataType).
>> >>> when I search the tweet with "#"  OR "@"  I dont get the exact result.
>> >>> e.g.  when I search for "#ipad" OR "@ipad"   I get the result where
>> ipad
>> >>> is
>> >>> mentioned skipping the "#" and "@".
>> >>> please suggest me, how to tune or what are filterFactories to use to
>> get
>> >>> the
>> >>> desired result.
>> >>> I am indexing the tweet as "text", below is "text" which is there in my
>> >>> schema.xml.
>> >>>
>> >>>
>> >>> <fieldType name="text" class="solr.TextField"
>> positionIncrementGap="100">
>> >>> <analyzer type="index">
>> >>>     <tokenizer class="solr.**KeywordTokenizerFactory"/>
>> >>>     <filter class="solr.**CommonGramsFilterFactory"
>> words="stopwords.txt"
>> >>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
>> >>>     <filter class="solr.**WordDelimiterFilterFactory"
>> >>> generateWordParts="1"
>> >>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> >>> catenateAll="0" splitOnCaseChange="1"/>
>> >>>     <filter class="solr.**LowerCaseFilterFactory"/>
>> >>>     <filter class="solr.**SnowballPorterFilterFactory"
>> >>> protected="protwords.txt" language="English"/>
>> >>> </analyzer>
>> >>> <analyzer type="query">
>> >>>         <tokenizer class="solr.**KeywordTokenizerFactory"/>
>> >>>         <filter class="solr.**CommonGramsFilterFactory"
>> >>> words="stopwords.txt"
>> >>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
>> >>>         <filter class="solr.**WordDelimiterFilterFactory"
>> >>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> >>>         <filter class="solr.**LowerCaseFilterFactory"/>
>> >>>         <filter class="solr.**SnowballPorterFilterFactory"
>> >>> protected="protwords.txt" language="English"/>
>> >>> </analyzer>
>> >>> </fieldType>
>> >>>
>> >>>
>> >
>> >
>> > --
>> > Thanks and Regards
>> > Mohammad Shariq
>> >
>>
>
>
>
> --
> Thanks and Regards
> Mohammad Shariq
>

Re: Indexing tweet and searching "@keyword" OR "#keyword"

Reply via email to