On Tue, Jan 17, 2012 at 4:35 AM, Rasmus Svensson <[email protected]>wrote:

> You can use this as a temporary workaround:
>
>    (require '[clojure.string :as str])
>
>    (defn strip-supplementary [s]
>      (str/replace s #"[^\u0000-\uFFFF]+" "(removed supplementary
> characters)"))
>
>    (strip-supplementary "The first three letters of the Gothic
> alphabet are: \uD800\uDF30\uD800\uDF31\uD800\uDF32")
>    ;=> "The first three letters of the Gothic alphabet are: (removed
> supplementary characters)"
>

Rasmus, thanks for that suggestion.  I have seen this regular expression
before recently for the same purpose, but not an explanation for why it
matches only supplementary characters.  Do you know, or have you read
somewhere, a good explanation for that?

I know that JDK 1.6.x's regex matching code is supposed to be UTF-16 aware,
so perhaps what is going on here is that even though the range
\u0000-\uFFFF includes the surrogate characters, the regex matching engine
never matches 16-bit surrogate characters by themselves, only as surrogate
pairs, and the code point of a surrogate pair together is outside of the
range \u0000-\uFFFF ?

I've found an email conversation between Tom Christianson of Perl fame and
a Java developer at Oracle from early 2011 where they discussed many
details of Java's unicode regex matching behavior and what was missing from
it, but I don't yet have a brief description of what JDK 1.6's regex
matching can do with regard to Unicode, and what it cannot do (although I
have a few individual test cases showing what it cannot do that I'm working
on).

Thanks,
Andy

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to