There are many, many problems with this analyzer chain definition.

This is a summary of the indexing chain:

* WhitespaceTokenizerFilter
* LowerCaseFilter
* SynonymFilter (with ignoreCase=true after lower-casing everything)
* StopFilter (we should have stopped using stopwords 20 years ago)
* WordDelimiterFilter (with all the transformation options set to 0, does 
nothing)
* RemoveDuplicates (this must always be last)
* KStemFilter (good choice)
* EdgeNGramFilter (!!! are you doing prefix matching? doing that with stemming 
makes bizarre matches)
* ReverseStringFilter (Yowza! Only do this on unmodified tokens, what does this 
mean on word stems? Even more bizarre)

Reversed stemmed edge ngrams should cause some really exciting matches. 

Summary of the query chain:

* WhitespaceTokenizerFilter
* LowerCaseFilter
* PorterStemFilter (different stemmer from indexing, guarantees missed matches)
* SynonymFilter (after stemmer? never do this, all tokens need stemmed)
* StopFilter (bad, but extra bad after a Porter stemmer that doesn’t generate 
dictionary words)
* WordDelimiterFilter (again, doing nothing, also the results should have been 
stemmed)
* KStemFilter (two stemmers in a chain! never do that! plus the Porter stemmer 
doesn’t produce dictionary words, so KStem won’t do much)

Short version, I’m astonished that this configuration works at all. Delete the 
whole thing, use one from the sample file (without stop words), and reindex. 
There is no way to fix this. Not to be mean, but this is the worst field type 
definition I have ever seen.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 21, 2019, at 4:24 AM, Lavanya Thirumalaisami 
> <lav...@yahoo.co.in.INVALID> wrote:
> 
> 
> Thank you Aman Deep 
> I tried removing the kstem filter factory and still get the same issue, but 
> when i comment the Porterstemfilterfactory the character y does not get 
> replaced. 
> 
>    On Monday, 21 January 2019, 11:16:23 pm AEDT, Aman deep singh 
> <amandeep.coo...@gmail.com> wrote:  
> 
> Hi Lavanya,
> This is probably due to the kstem Filter factory it is removing the y 
> charactor ,since the stemmer has rule of words ending with y .
> 
> 
> Regards,
> Aman Deep Singh
> 
>> On 21-Jan-2019, at 5:43 PM, Mikhail Khludnev <m...@apache.org> wrote:
>> 
>> querystring  is what goes into QPaser,  parsedquery  is
>> LuceneQuery.toString()
>> 
>> On Mon, Jan 21, 2019 at 3:04 PM Lavanya Thirumalaisami
>> <lav...@yahoo.co.in.invalid> wrote:
>> 
>>> Hi,
>>> Our solr search is not returning expected results for keywords ending with
>>> the character 'y'.
>>> For example keywords like battery, way, accessory etc.
>>> I tried debugging the solr query in solr admin console and i find there is
>>> a difference between query string and parsed query.
>>> "querystring":"battery","parsedquery":"batteri",
>>> Also I find that if i search omitting the character y i am getting all the
>>> results.
>>> This happens only for keywords ending with Y and most others we donot have
>>> this issue.
>>> Could any one please help me understand why is the keywords gets changed,
>>> specially the last character. Is there any issues in my field type
>>> definition.
>>> While indexing the data we use the text data type and we have defined as
>>> follows
>>> <fieldType class="solr.TextField" name="ctext"
>>> positionIncrementGap="100"> <analyzer type="index"> <tokenizer
>>> class="solr.WhitespaceTokenizerFactory" /> <filter
>>> class="solr.LowerCaseFilterFactory" /> <filter
>>> class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
>>> expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" /> <filter catenateAll="1" catenateNumbers="1"
>>> catenateWords="1" class="solr.WordDelimiterFilterFactory"
>>> generateNumberParts="0" generateWordParts="0" preserveOriginal="1"
>>> splitOnCaseChange="0" splitOnNumerics="0" /> <filter
>>> class="solr.RemoveDuplicatesTokenFilterFactory" /> <filter
>>> class="solr.KStemFilterFactory" /> <filter
>>> class="solr.EdgeNGramFilterFactory" maxGramSize="255" minGramSize="1" />
>>> <filter class="solr.ReverseStringFilterFactory" />  </analyzer> <analyzer
>>> type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter
>>> class="solr.LowerCaseFilterFactory" /> <filter
>>> class="solr.PorterStemFilterFactory" /> <filter
>>> class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
>>> expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" /> <filter catenateAll="0" catenateNumbers="0"
>>> catenateWords="0" class="solr.WordDelimiterFilterFactory"
>>> generateNumberParts="0" generateWordParts="0" preserveOriginal="1"
>>> splitOnCaseChange="0" splitOnNumerics="0" /> <filter
>>> class="solr.KStemFilterFactory" /> </analyzer> </fieldType>
>>> 
>>> Regards,Lavanya
>> 
>> 
>> 
>> -- 
>> Sincerely yours
>> Mikhail Khludnev

Reply via email to