Encoding problem while indexing

2011-06-29 Thread Engy Morsy
I am working on indexing arabic documents containg arabic diacritics and 
dotless characters (old arabic characters), I am using Apache Tomcat server, 
and I am using my modified version of the aramorph analyzer as the arabic 
analyzer. I managed on the development enviorment to normalize the arabic 
diacritics and dotless characters (same concept as in the 
solr.ArabicNormalizationFilterFactory). and i can verfiy that the analyzer is 
working fine, and i get the correct stem for arabic words. the input text file 
for testing has a utf-8 encoding.

When i build the aramorph jar file and place it under solr lib, the diacritics 
and the dotless characters splits the word. I made sure that the server.xml 
contains the URI-Encoding="utf-8".

I also made sure that the text being send to solr using solj is utf-8 encoding
example : solr.addBean(new Doc("4",new String("حِباًَ".getBytes("UTF8";

but nothing is working.

I tried to use the analyze link on solr admin for both indexing and querying 
and both shows that the arabic word is splited if a diacritics or dotless 
character is found.

Do you have any idea what might be the problem


schema snippet:






I also added the following parameter to the JVM: -Dfile.encoding=UTF-8

Thanks,
engy


Question regarding solr workflow

2011-07-04 Thread Engy Morsy
Hi,

What is the workflow of solr starting from submitting an xml document to be 
indexed? Is there any default analyzer that is called before the analyzer 
specified in my solr schema for the text field. I have a situation where the 
words of the text field that will be analyzed if somehow splitted.

For example if I have a text field "ABC DEF", I can get it like "AB C D EF".

Thanks
engy


Increase String length

2011-07-10 Thread Engy Morsy
Dear all,

In schema.xml I had the following fieldType definition



The length of the string value I am indexing exceeds the default length (256), 
how do I override the default length in my schema.


Thanks and best regards,
Engy Morsy

Project Manager
ICT Department
Bibliotheca Alexandrina
P.O.Box 138, Chatby
Alexandria 21526, Egypt
Tel: +20-3-483 Ext:1423
Fax: +20-3-4820405
E-mail: engy.mo...@bibalex.org<mailto:engy.mo...@bibalex.org>
wedsites: http://bibalex.org
http://dar.bibalex.org