Do you really want a search on "ipad" to *fail* to match input of "#ipad"? Or vice-versa? My requirement is : I want to search both '#ipad' and 'ipad' for q='ipad' BUT for q='#ipad' I want to search ONLY '#ipad' excluding 'ipad'.
On 10 August 2011 19:49, Erick Erickson <erickerick...@gmail.com> wrote: > Please look more carefully at the documentation for WDDF, > specifically: > > split on intra-word delimiters (all non alpha-numeric characters). > > WordDelimiterFilterFactory will always throw away non alpha-numeric > characters, you can't tell it do to otherwise. Try some of the other > tokenizers/analyzers to get what you want, and also look at the > admin/analysis page to see what the exact effects are of your > fieldType definitions. > > Here's a great place to start: > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters > > You probably want something like WhitespaceTokenizerFactory > followed by LowerCaseFilterFactory or some such... > > But I really question whether this is what you want either. Do you > really want a search on "ipad" to *fail* to match input of "#ipad"? Or > vice-versa? > > KeywordTokenizerFactory is probably not the place you want to start, > the tokenization process doesn't break anything up, you happen to be > getting separate tokens because of WDDF, which as you see can't > process things the way you want. > > > Best > Erick > > On Wed, Aug 10, 2011 at 3:09 AM, Mohammad Shariq <shariqn...@gmail.com> > wrote: > > I tried tweaking "WordDelimiterFactory" but I won't accept # OR @ symbols > > and it ignored totally. > > I need solution plz suggest. > > > > On 4 August 2011 21:08, Jonathan Rochkind <rochk...@jhu.edu> wrote: > > > >> It's the WordDelimiterFactory in your filter chain that's removing the > >> punctuation entirely from your index, I think. > >> > >> Read up on what the WordDelimiter filter does, and what it's settings > are; > >> decide how you want things to be tokenized in your index to get the > behavior > >> your want; either get WordDelimiter to do it that way by passing it > >> different arguments, or stop using WordDelimiter; come back with any > >> questions after trying that! > >> > >> > >> > >> On 8/4/2011 11:22 AM, Mohammad Shariq wrote: > >> > >>> I have indexed around 1 million tweets ( using "text" dataType). > >>> when I search the tweet with "#" OR "@" I dont get the exact result. > >>> e.g. when I search for "#ipad" OR "@ipad" I get the result where > ipad > >>> is > >>> mentioned skipping the "#" and "@". > >>> please suggest me, how to tune or what are filterFactories to use to > get > >>> the > >>> desired result. > >>> I am indexing the tweet as "text", below is "text" which is there in my > >>> schema.xml. > >>> > >>> > >>> <fieldType name="text" class="solr.TextField" > positionIncrementGap="100"> > >>> <analyzer type="index"> > >>> <tokenizer class="solr.**KeywordTokenizerFactory"/> > >>> <filter class="solr.**CommonGramsFilterFactory" > words="stopwords.txt" > >>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/> > >>> <filter class="solr.**WordDelimiterFilterFactory" > >>> generateWordParts="1" > >>> generateNumberParts="1" catenateWords="1" catenateNumbers="1" > >>> catenateAll="0" splitOnCaseChange="1"/> > >>> <filter class="solr.**LowerCaseFilterFactory"/> > >>> <filter class="solr.**SnowballPorterFilterFactory" > >>> protected="protwords.txt" language="English"/> > >>> </analyzer> > >>> <analyzer type="query"> > >>> <tokenizer class="solr.**KeywordTokenizerFactory"/> > >>> <filter class="solr.**CommonGramsFilterFactory" > >>> words="stopwords.txt" > >>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/> > >>> <filter class="solr.**WordDelimiterFilterFactory" > >>> generateWordParts="1" generateNumberParts="1" catenateWords="1" > >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > >>> <filter class="solr.**LowerCaseFilterFactory"/> > >>> <filter class="solr.**SnowballPorterFilterFactory" > >>> protected="protwords.txt" language="English"/> > >>> </analyzer> > >>> </fieldType> > >>> > >>> > > > > > > -- > > Thanks and Regards > > Mohammad Shariq > > > -- Thanks and Regards Mohammad Shariq