On Thu, Jan 13, 2011 at 2:05 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote: > > There are various packages of such heuristic algorithms to guess char > encoding, I wouldn't try to write my own. icu4j might include such an > algorithm, not sure. >
it does: http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html this takes a sample of the file and makes a guess. also, in general keep in mind that java CharsetDecoders tend to silently replace or skip illegal chars, rather than throw exceptions. If you want to instead be "paranoid" about these things, instead of opening InputStreamReader with Charset, open it with something like charset.newDecoder().onMalformedInput(CodingErrorAction.REPORT).onUnmappableCharacter(CodingErrorAction.REPORT) Then if the decoder ends up in some illegal state/byte sequence, instead of silently replacing with U+FFFD, it will throw an exception. Of course as Jonathan says, you cannot "confirm" that something is UTF-8. But many times, you can "confirm" its definitely not: see https://issues.apache.org/jira/browse/SOLR-2003 for an example practical use of this, we throw an exception if we can detect that your stopwords or synonyms file is definitely wrongly-encoded.