take a look also into icu4j which is one of the contrib projects ... > converting on the fly is not supported by Solr but should be relative > easy in Java. > Also scanning is relative simple (accept only a range). Detection too: > http://www.mozilla.org/projects/intl/chardet.html > >> We've created an index from a number of different documents that are >> supplied by third parties. We want the index to only contain UTF-8 >> encoded characters. I have a couple questions about this: >> >> 1) Is there any way to be sure during indexing (by setting something >> in the solr configuration?) that the documents that we index will >> always be stored in utf-8? Can solr convert documents that need >> converting on the fly, or can solr reject documents containing illegal >> characters? >> >> 2) Is there a way to scan the existing index to find any string >> containing non-utf8 characters? Or is there another way that I can >> discover if any crept into my index? >> >
-- http://jetwick.com open twitter search