On Tue, Jan 17, 2012 at 4:35 AM, Rasmus Svensson <[email protected]>wrote:
> You can use this as a temporary workaround: > > (require '[clojure.string :as str]) > > (defn strip-supplementary [s] > (str/replace s #"[^\u0000-\uFFFF]+" "(removed supplementary > characters)")) > > (strip-supplementary "The first three letters of the Gothic > alphabet are: \uD800\uDF30\uD800\uDF31\uD800\uDF32") > ;=> "The first three letters of the Gothic alphabet are: (removed > supplementary characters)" > Rasmus, thanks for that suggestion. I have seen this regular expression before recently for the same purpose, but not an explanation for why it matches only supplementary characters. Do you know, or have you read somewhere, a good explanation for that? I know that JDK 1.6.x's regex matching code is supposed to be UTF-16 aware, so perhaps what is going on here is that even though the range \u0000-\uFFFF includes the surrogate characters, the regex matching engine never matches 16-bit surrogate characters by themselves, only as surrogate pairs, and the code point of a surrogate pair together is outside of the range \u0000-\uFFFF ? I've found an email conversation between Tom Christianson of Perl fame and a Java developer at Oracle from early 2011 where they discussed many details of Java's unicode regex matching behavior and what was missing from it, but I don't yet have a brief description of what JDK 1.6's regex matching can do with regard to Unicode, and what it cannot do (although I have a few individual test cases showing what it cannot do that I'm working on). Thanks, Andy -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to [email protected] Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/clojure?hl=en
