On 24.06.14 17:33, Erick Erickson wrote:
Hmmm. It would help if you posted a couple of other
pieces of information.... BTW, if this is new code are you
considering donating it back? If so please open a JIRA so
we can track it, see: http://wiki.apache.org/solr/HowToContribute
All my other language improvements for the existing Norwegian stemmers
have been donated back to Solr, so yes, if possible. I want to
experiment a little bit before I open a ticket.
But to your question:
First couple of things I'd do:
1> see what the admin/analysis page tells you happens.
Shows correct results for index and query. The lemmatizer is enable to
find the correct stem.
2> attach &debug=query to your test case, see what the parsed
query looks like.
Seems to be OK. Remember that the problem is related to indexing, not
querying. I have double-checked by indexing all the documents by another
stemmer and configured my lemmatizer only for queries. Then everything
works as it should. Here's the query. As you can see, "studentene" is
stemmed to "student" for two fields (content_no and title_no) which is
correct:
BoostedQuery(boost(+(title_en:studentene^10.0 | host:studentene^30.0 |
content_en:studentene^0.1 | content_no:student^0.1 |
title_no:student^10.0 | anchortext_partial:studentene^70.0 |
subjectcode:studentene^100.0 | canonicalurl:studentene^5.0)~0.2 () () ()
() () (product(int(url_toplevel),const(5)))^20.0
(2.0/(1.0*float(int(url_levels))+1.0))^250.0
(product(float(docrank),const(10000)))^4.0
(1.0/(3.16E-11*float(ms(const(1403686863701),date(last_modified)))+1.0))^50.0
(product(int(url_landingpage),const(3)))^40.0,product(float(urlboost),map(query(language:no,def=0.0),0.0,0.0,1.0))))
3> use the admin/schema browser link for the field in question
to see what actually makes it into the index. (Or use Luke or
even the TermsComponent).
I haven't played much around with this, but is says "27" for "docs" if I
select the field "content_no". Does this mean that there are only 27
documents in my index with data in this field? Then there is something
really bad going on, because if I change to content_en, this number
grows to 10526 (because another English stemmer is used for that field
instead).
If I change to NorwegianMinimalStemFilter and reindex everything, the
number grows to 28270.
By writing out debugging info from my stemmer, I just figured out that
only the document's titles are being stemmed at index time, not the
content itself. So I have found the root of the problem, but I'm not
sure why the field is omitted.
Erlend