Re: TokenFilter not working at index time

Erlend Garåsen Wed, 25 Jun 2014 06:54:34 -0700

On 24.06.14 17:33, Erick Erickson wrote:

Hmmm. It would help if you posted a couple of other
pieces of information.... BTW, if this is new code are you
considering donating it back? If so please open a JIRA so
we can track it, see: http://wiki.apache.org/solr/HowToContribute

All my other language improvements for the existing Norwegian stemmershave been donated back to Solr, so yes, if possible. I want toexperiment a little bit before I open a ticket.

But to your question:
First couple of things I'd do:
1> see what the admin/analysis page tells you happens.

Shows correct results for index and query. The lemmatizer is enable tofind the correct stem.

2> attach &debug=query to your test case, see what the parsed
     query looks like.

Seems to be OK. Remember that the problem is related to indexing, notquerying. I have double-checked by indexing all the documents by anotherstemmer and configured my lemmatizer only for queries. Then everythingworks as it should. Here's the query. As you can see, "studentene" isstemmed to "student" for two fields (content_no and title_no) which iscorrect:

BoostedQuery(boost(+(title_en:studentene^10.0 | host:studentene^30.0 |content_en:studentene^0.1 | content_no:student^0.1 |title_no:student^10.0 | anchortext_partial:studentene^70.0 |subjectcode:studentene^100.0 | canonicalurl:studentene^5.0)~0.2 () () ()() () (product(int(url_toplevel),const(5)))^20.0(2.0/(1.0*float(int(url_levels))+1.0))^250.0(product(float(docrank),const(10000)))^4.0(1.0/(3.16E-11*float(ms(const(1403686863701),date(last_modified)))+1.0))^50.0(product(int(url_landingpage),const(3)))^40.0,product(float(urlboost),map(query(language:no,def=0.0),0.0,0.0,1.0))))

3> use the admin/schema browser link for the field in question
    to see what actually makes it into the index. (Or use Luke or
    even the TermsComponent).

I haven't played much around with this, but is says "27" for "docs" if Iselect the field "content_no". Does this mean that there are only 27documents in my index with data in this field? Then there is somethingreally bad going on, because if I change to content_en, this numbergrows to 10526 (because another English stemmer is used for that fieldinstead).

If I change to NorwegianMinimalStemFilter and reindex everything, thenumber grows to 28270.

By writing out debugging info from my stemmer, I just figured out thatonly the document's titles are being stemmed at index time, not thecontent itself. So I have found the root of the problem, but I'm notsure why the field is omitted.


Erlend

Re: TokenFilter not working at index time

Reply via email to