M
To: solr-user@lucene.apache.org
Subject: Re: verifying that an index contains ONLY utf-8
Thanks for all the responses.
CharsetDetector does look promising. Unfortunately, we aren't allowed
to keep the original of much of our data, so the solr index is the
only place it exists (to us). I do
Thanks for all the responses.
CharsetDetector does look promising. Unfortunately, we aren't allowed
to keep the original of much of our data, so the solr index is the
only place it exists (to us). I do have a java app that "reindexes",
i.e., reads all documents out of one index, does some transfor
On Thu, Jan 13, 2011 at 2:05 PM, Jonathan Rochkind wrote:
>
> There are various packages of such heuristic algorithms to guess char
> encoding, I wouldn't try to write my own. icu4j might include such an
> algorithm, not sure.
>
it does:
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Chars
The tokens that Lucene sees (pre-4.0) are char[] based (ie, UTF16), so
the first place where invalid UTF8 is detected/corrected/etc. is
during your analysis process, which takes your raw content and
produces char[] based tokens.
Second, during indexing, Lucene ensures that the incoming char[]
toke
Scanning for only 'valid' utf-8 is definitely not simple. You can
eliminate some obviously not valid utf-8 things by byte ranges, but you
can't confirm valid utf-8 alone by byte ranges. There are some bytes
that can only come after or before other certain bytes to be valid utf-8.
There is no
take a look also into icu4j which is one of the contrib projects ...
> converting on the fly is not supported by Solr but should be relative
> easy in Java.
> Also scanning is relative simple (accept only a range). Detection too:
> http://www.mozilla.org/projects/intl/chardet.html
>
>> We've crea
converting on the fly is not supported by Solr but should be relative
easy in Java.
Also scanning is relative simple (accept only a range). Detection too:
http://www.mozilla.org/projects/intl/chardet.html
> We've created an index from a number of different documents that are
> supplied by third p
This is supposed to be dealt with outside the index. All input must be UTF-8
encoded. Failing to do so will give unexpected results.
> We've created an index from a number of different documents that are
> supplied by third parties. We want the index to only contain UTF-8
> encoded characters. I
We've created an index from a number of different documents that are
supplied by third parties. We want the index to only contain UTF-8
encoded characters. I have a couple questions about this:
1) Is there any way to be sure during indexing (by setting something
in the solr configuration?) that th