I am working on indexing arabic documents containg arabic diacritics and
dotless characters (old arabic characters), I am using Apache Tomcat server,
and I am using my modified version of the aramorph analyzer as the arabic
analyzer. I managed on the development enviorment to normalize the arabic
diacritics and dotless characters (same concept as in the
solr.ArabicNormalizationFilterFactory). and i can verfiy that the analyzer is
working fine, and i get the correct stem for arabic words. the input text file
for testing has a utf-8 encoding.
When i build the aramorph jar file and place it under solr lib, the diacritics
and the dotless characters splits the word. I made sure that the server.xml
contains the URI-Encoding="utf-8".
I also made sure that the text being send to solr using solj is utf-8 encoding
example : solr.addBean(new Doc("4",new String("حِباًَ".getBytes("UTF8";
but nothing is working.
I tried to use the analyze link on solr admin for both indexing and querying
and both shows that the arabic word is splited if a diacritics or dotless
character is found.
Do you have any idea what might be the problem
schema snippet:
I also added the following parameter to the JVM: -Dfile.encoding=UTF-8
Thanks,
engy