Thanks for your reply, Sergey! Well, I was a bit puzzled. I tried adding a line to set the character set before, but then it complained about that as well. I installed the Russian dictionary and Solr was happy to load that. I noticed that the character-set was only set in the affix file for Russian. So, when I added the line 'SET UTF-8' only to the affix file for en_UK, all was well. I must have added that same line to the .dic file as well before and I suppose that was what Solr was complaining about.
I just checked that, and that seems to be the case. The character-set should only be set on the first line of the .aff file, the .dic file should be left alone. Thanks again Sergey, that was very useful. Best, - Rob On Wed, Nov 14, 2012 at 11:08 AM, Сергей Бирюков <kapac...@yandex.ru> wrote: > Rob, as regards your "problem" > >> 'SET charset' >> > 'charset' word must be replaced with a name-of-character-set (i.e. > encoding) > For exampe, you can write 'SET UTF-8' > > BUT... > > ---- > > Be careful! > At least for russian language morthology HunspellStemFilterFactory has > bug(s) in its algorythm. > > Simple comparison with original hunspell library shown huge difference. > > > For example on Linux x86_64 Ubuntu 12.10 > > 1) INSTALL: > # sudo apt-get install hunspell hunspell-ru > > > 2) TEST with string "мама мыла раму мелом" > (it has a meaning: "mom was_washing frame (with) chalk" ): > > 2.1) OS hunspell library > # echo "мама мыла раму мелом" | hunspell -d ru_RU -D -m > gives results: > ... > LOADED DICTIONARY: > /usr/share/hunspell/ru_RU.aff > /usr/share/hunspell/ru_RU.dic > > мама -> мама > мыла -> мыло | мыть <<< as noun | as verb > раму -> рама > мелом -> мел > > 2.2) solr's HunspellStemFilterFactory > config fieldType > <fieldType name="text_hunspell" class="solr.TextField" > positionIncrementGap="100"> > <analyzer> > <tokenizer class="solr.**WhitespaceTokenizerFactory"/> > <filter class="solr.**LowerCaseFilterFactory" /> > <filter class="solr.**HunspellStemFilterFactory" > dictionary="ru_RU.dic" affix="ru_RU.aff" ignoreCase="true" /> > </analyzer> > </fieldType> > > gives results: > мама -> мама | мама : FAILED: duplicate words > мыла -> мыть | мыло : SUSSECC: all OK > раму -> рама | расти : FAILED: second word is wrong and excess > мелом -> мести | метить | месть | мел : FAILED: only last word is > correct, other ones are excess > > ---------- > > That's why I use a JNA (v3.2.7) binding on original (system) > libhunspell.so for a long time :) > > ---- > Best regards > Sergey Biryukov > Moscow, Russian Federation > > > > 14.11.2012 04:18, Rob Koeling wrote: > >> If so, would you be willing to share the .dic and .aff files with me? >> When I try to load a dictionary file, Solr is complaining that: >> >> java.lang.RuntimeException: java.io.IOException: Unable to load hunspell >> data! [dictionary=en_GB.dic,affix=**en_GB.aff] >> at org.apache.solr.schema.**IndexSchema.<init>(** >> IndexSchema.java:116) >> ....... >> Caused by: java.text.ParseException: The first non-comment line in the >> affix file must be a 'SET charset', was: 'FLAG num' >> at >> org.apache.lucene.analysis.**hunspell.HunspellDictionary.** >> getDictionaryEncoding(**HunspellDictionary.java:306) >> at >> org.apache.lucene.analysis.**hunspell.HunspellDictionary.<** >> init>(HunspellDictionary.java:**130) >> at >> org.apache.lucene.analysis.**hunspell.**HunspellStemFilterFactory.** >> inform(**HunspellStemFilterFactory.**java:103) >> ... 46 more >> >> When I change the first line to 'SET charset' it is still not happy. I got >> the dictionary files from the OpenOffice website. >> >> I'm using Solr 4.0 (but had the same problem with 3.6) >> >> - Rob >> >> >