On Wed, Dec 26, 2012 at 2:20 PM, Al Johnson wrote: > I need to make sure a backend Java process is doing the same UTF > normalization that is done for edit text. Grep'ing for 'normaliz' brings > up a lot and I'm not a php dev. Can someone point me to a key php module > and/or function? >
The PHP code for this is in includes/normal -- luckily you shouldn't have to replicate most of that code which is nasty and low-level. For the most part, you want to do two things: * make sure the input is valid UTF-8 * normalize any composition character sequences to 'normalization form C' Reading data in from a UTF-8 input stream into a Java string should already take care of making sure it's valid UTF-8. :) If you want to treat invalid input the same, make sure that invalid UTF-8 sequences get converted to the 'replacement character' U+FFFD rather than throwing an exception. It looks like you should be able to use the java.text.Normalizer class to convert to NFC: < http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html> You might or might not prefer to use the Java version of the ICU library to do the same thing, it might be more up to date: < http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Normalizer.html> -- brion _______________________________________________ MediaWiki-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
