I don't have enough knowledge to tell you "Oh, just do this, and your Emacs
issues will be solved." but I can give some hints as to what these
characters are, so perhaps others can say, or you can direct your Google
searches in a more focused manner.

I believe those are Unicode characters, and ones called "supplementary"
characters, and thus due to the way Java and Clojure store strings are
stored (which is in an encoding called UTF-16 [1]), they require 2
consecutive 16-bit Java chars to represent.  These don't get tested as
often as characters in the BMP (Basic Multilingual Plane -- this covers the
most common characters used, and require only a single Java char to store),
in most software.  When it comes to copying and pasting them between
applications, or sending them across debug sockets, every piece of software
along the way gets its own chance to muck things up.  Likely a day will
come when software that doesn't handle these things properly will be rare,
but I don't think we are there yet.

The particular characters you gave as an example appear to be Unicode
characters with these code points:

U+1F60A "SMILING FACE WITH SMILING EYES"
U+1F60F "SMIRKING FACE"

I am not sure if these are considered Emoji [2] characters or not, but I
have heard that these characters are getting popular in Twitter, phone text
messages, and a few other places.  I found that out by saving the web page
as an HTML file, opening that file in Emacs, moving the cursor over those
characters, pressing C-x =.  At the bottom of the window doing so shows
this for the first character:

Char: <empty rectangle> (128522, #o373012, #x1f60a, file ...) point=26267
of 54273 (48%) columns=91

The empty rectangle is because the font I was using didn't include a glyph
for this character.  the 3 numbers in parentheses are the decimal, octal,
and hex value of the Unicode code point -- I copies the hex value above and
looked up its name in this file:

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

There are other web sites that let you search for these things, too, and
see them graphically in the browser window even if you don't have the
proper fonts installed.

I don't know if it is only supplementary characters that will cause you
problems, but if so, you can detect strings containing them with a function
like this:

(defn contains-supp?
  "Returns true if the string or CharSequence s contains supplementary
   characters, outside the Basic Multilingual Plane.  Returns false if
   the string only contains characters in the BMP.

   For Java/Clojure strings, which are encoded in UTF-16, a string
   contains supplementary characters only if the string contains at
   least one surrogate code unit in the range U+D800 through U+DFFF."
  [^CharSequence s]
  (if (first (filter #(<= (int Character/MIN_SURROGATE)
                          % (int Character/MAX_SURROGATE))
                     (map int s)))
    true false))

You can leave out the "(if" and "true false)" if you don't mind getting
back nil for false and non-nil for true, which Clojure if/when/etc. all
interpret as false (nil), or true (any value other than nil or false).

Andy

[1] http://en.wikipedia.org/wiki/UTF-16

[2] http://en.wikipedia.org/wiki/Emoji


On Mon, Jan 16, 2012 at 7:58 AM, joachim <[email protected]> wrote:

> Dear All,
>
> I'm not sure if this is the right place to ask, but I am experiencing
> a strange and rather annoying problem, probably in the interaction
> between clojure and emacs.
>
> Basically, I have to deal with strings. Sometimes the strings contain
> non-standard characters (I do not know the nature of these characters
> myself). Here is an example string, with the non-standard characters
> printed as squares:
>
>   "Michelle Obama is a Capricorn !!! Jan 17th. 😊😏 ---&gt;
> http://t.co/1moZ4IUZ";
>
> When I run clojure from a terminal and input the above string there is
> no problem:
>
>   joachim@joachim-HP-EliteBook-8440p:~/opt/clojure-1.3.0$ java -jar
> clojure-1.3.0.jar
>   Clojure 1.3.0
>   user=> "Michelle Obama is a Capricorn !!! Jan 17th. 😊😏 ---&gt;
> http://t.co/1moZ4IUZ";
>   "Michelle Obama is a Capricorn !!! Jan 17th. 😊😏 ---&gt;
> http://t.co/1moZ4IUZ";
>   user=>
>
> However, when I try the same in an emacs repl, I get  "Lisp connection
> closed unexpectedly: connection broken by remote peer". I have no idea
> what is going on or how to deal with this problem. Sometimes during
> development I like to print the strings to see what is going on, but
> this also causes the connection to close.
>
> I would also be happy if I could recognize "problematic" strings, so
> that I can skip them when printing, thus avoiding the problem
> (although this would not really be a solution),
>
> Any ideas?
>
> Joachim.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to [email protected]
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to