I don't see an easy way to do that with the standard set of filters. You'll probably need to write something custom (note, this is actually pretty easy). I suspect you'll need to do something like Synonyms, where when you get a token like #ipod, you essentially make it a synonym for ipod and insert both in the document...
This assumes you can't create a list of all the terms you want treated this way, because you could just synonyms if you could. Best Erick On Thu, Aug 11, 2011 at 1:37 AM, Mohammad Shariq <shariqn...@gmail.com> wrote: > Do you really want a search on "ipad" to *fail* to match input of "#ipad"? > Or > vice-versa? > My requirement is : I want to search both '#ipad' and 'ipad' for q='ipad' > BUT for q='#ipad' I want to search ONLY '#ipad' excluding 'ipad'. > > > On 10 August 2011 19:49, Erick Erickson <erickerick...@gmail.com> wrote: > >> Please look more carefully at the documentation for WDDF, >> specifically: >> >> split on intra-word delimiters (all non alpha-numeric characters). >> >> WordDelimiterFilterFactory will always throw away non alpha-numeric >> characters, you can't tell it do to otherwise. Try some of the other >> tokenizers/analyzers to get what you want, and also look at the >> admin/analysis page to see what the exact effects are of your >> fieldType definitions. >> >> Here's a great place to start: >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters >> >> You probably want something like WhitespaceTokenizerFactory >> followed by LowerCaseFilterFactory or some such... >> >> But I really question whether this is what you want either. Do you >> really want a search on "ipad" to *fail* to match input of "#ipad"? Or >> vice-versa? >> >> KeywordTokenizerFactory is probably not the place you want to start, >> the tokenization process doesn't break anything up, you happen to be >> getting separate tokens because of WDDF, which as you see can't >> process things the way you want. >> >> >> Best >> Erick >> >> On Wed, Aug 10, 2011 at 3:09 AM, Mohammad Shariq <shariqn...@gmail.com> >> wrote: >> > I tried tweaking "WordDelimiterFactory" but I won't accept # OR @ symbols >> > and it ignored totally. >> > I need solution plz suggest. >> > >> > On 4 August 2011 21:08, Jonathan Rochkind <rochk...@jhu.edu> wrote: >> > >> >> It's the WordDelimiterFactory in your filter chain that's removing the >> >> punctuation entirely from your index, I think. >> >> >> >> Read up on what the WordDelimiter filter does, and what it's settings >> are; >> >> decide how you want things to be tokenized in your index to get the >> behavior >> >> your want; either get WordDelimiter to do it that way by passing it >> >> different arguments, or stop using WordDelimiter; come back with any >> >> questions after trying that! >> >> >> >> >> >> >> >> On 8/4/2011 11:22 AM, Mohammad Shariq wrote: >> >> >> >>> I have indexed around 1 million tweets ( using "text" dataType). >> >>> when I search the tweet with "#" OR "@" I dont get the exact result. >> >>> e.g. when I search for "#ipad" OR "@ipad" I get the result where >> ipad >> >>> is >> >>> mentioned skipping the "#" and "@". >> >>> please suggest me, how to tune or what are filterFactories to use to >> get >> >>> the >> >>> desired result. >> >>> I am indexing the tweet as "text", below is "text" which is there in my >> >>> schema.xml. >> >>> >> >>> >> >>> <fieldType name="text" class="solr.TextField" >> positionIncrementGap="100"> >> >>> <analyzer type="index"> >> >>> <tokenizer class="solr.**KeywordTokenizerFactory"/> >> >>> <filter class="solr.**CommonGramsFilterFactory" >> words="stopwords.txt" >> >>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/> >> >>> <filter class="solr.**WordDelimiterFilterFactory" >> >>> generateWordParts="1" >> >>> generateNumberParts="1" catenateWords="1" catenateNumbers="1" >> >>> catenateAll="0" splitOnCaseChange="1"/> >> >>> <filter class="solr.**LowerCaseFilterFactory"/> >> >>> <filter class="solr.**SnowballPorterFilterFactory" >> >>> protected="protwords.txt" language="English"/> >> >>> </analyzer> >> >>> <analyzer type="query"> >> >>> <tokenizer class="solr.**KeywordTokenizerFactory"/> >> >>> <filter class="solr.**CommonGramsFilterFactory" >> >>> words="stopwords.txt" >> >>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/> >> >>> <filter class="solr.**WordDelimiterFilterFactory" >> >>> generateWordParts="1" generateNumberParts="1" catenateWords="1" >> >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> >> >>> <filter class="solr.**LowerCaseFilterFactory"/> >> >>> <filter class="solr.**SnowballPorterFilterFactory" >> >>> protected="protwords.txt" language="English"/> >> >>> </analyzer> >> >>> </fieldType> >> >>> >> >>> >> > >> > >> > -- >> > Thanks and Regards >> > Mohammad Shariq >> > >> > > > > -- > Thanks and Regards > Mohammad Shariq >