On Sun, Jun 27, 2010 at 8:25 PM, Eli Zaretskii <[email protected]> wrote:

> > Date: Sun, 27 Jun 2010 06:30:27 +0300
> > From: Amit Aronovitch <[email protected]>
> > Cc: Eli Zaretskii <[email protected]>, [email protected]
> >
> > First, thanks Eli and all contributors for the remarkable effort, and all
> > the recent progress!
>
> You're most welcome.
>
> > Note that there are two separate issues:
> > (1) Directionality (I'll use here B to represent hebrew Bet):
> >    Should the message be displayed "is undefined B^" (RTL paragraph dir)
>  or
> > "^B is undefined" (LTR paragraph dir)
> >
> > (2)  Alignment (to right or left margin) - where that message is to be
> > displayed. It makes sense to align to the "start" direction    (i.e.
> right
> > for RTL and left for LTR), but AFAIK this is a matter of style and not
> > within the scope of the unicode standard.
> >
> >     (2) is a relatively minor problem, while (1) could be a real source
> for
> > confusion to the reader.
>
> In Emacs, (2) is entirely determined by (1): a L2R paragraph is
> displayed flushed all the way to the left margin of the window, while
> R2L paragraphs are flushed to the right margin.
>
>
This is perfectly acceptable. I just wanted to point out the problem more
clearly, as the OP named the *alignment* as being wrong (which is correlated
to, but not exactly the actual problem).

I don't see any reason to have the paragraph and alignment be
> independent.  Every bidi-aware word processor I've seen behaves like I
> described above, and I'm quite sure users expect that.
>
>
Of course. This is the most reasonable default.
However, word processors typically also have an option for selectively
modifying the alignment without effecting the directionality (toolbars have
separate buttons for directionality and alignment), and this gets them out
of sync.
(Such explicit alignment information might not be saved in plain-text files,
but might be useful for "rich" formats - maybe in w3 mode etc.)
One example where this might be useful is when you have a list of items
(names, addresses, cited references), some of which RTL and some LTR, and
you wish the whole list to align to a single margin, to avoid a ragged
appearance. Another example is within tables.

> True. There is no way to the determine 100% surely the correct direction
> of
> > a sentence out of context. That is why the unicode standard leaves the
> > freedom for "higher level protocol" to set that (
> > http://unicode.org/reports/tr9/ HL1) .
> > When such information is not available, a simple default algorithm is
> > described by the standard (rules P2, P3). This is implemented by common
> bidi
> > reordering libs, and I guess this is the reason for what you see here.
>
> Emacs doesn't use any reordering libraries, but it does implement
> UAX#9 to the letter, including determining the paragraph direction
> from its first strong directional character.
>
>
Would be nice if we would be able to specify the direction explicitly
(manually) for selected paragraphs in the buffer. This can be stored in the
same way that other metadata (font sizes? color? images?) is being handled.
(p.s. If the buffer is plaintext, this information would probably be lost
when we save it. Still it might serve as a "manual override" to help
readability as long as the buffer is open).

> > Aren't problems like this the entire raison d'etre of the invisible RLM
> > > and LRM characters?
> > >
> > >
> > One of the main reasons. True. But, depending on the bidi reordering
> > function used, the application might be able to achieve the results by
> > providing this "higher level choice" itself. With libfribidi, the
> > "pbase_dir" input parameter can be used for that.
>
> In Emacs, we have the bidi-paragraph-direction variable, which
> overrides the direction determined by the first strong character.
>

Is that per-buffer? What if you want to control directionality of specific
paragraphs? (you should be able to do that to properly show bidi text e.g.
in w3 mode).


>
> > IMO, since the echo messages are typically one-liners, their
> directionality
> > should be defined by their language.
>
> But what is the language of a message that includes mixed Hebrew and
> English words or letters?
>
>
In all cases I can think of, the language of the message (the messages to be
displayed in the echo area) should be as specified by the locale
(LC_MESSAGES). This is because if the locale is English, the message itself
(the informative wrapper, the template) is actually meant to be in English,
and any Hebrew parts come from quoted characters etc. (template data,
runtime variables). Vice versa for the case where LC_MESSAGES=he .

(Explanation for readers who are not familiar with the terms: Typically, for
 i18n support in Unix apps, you write default messages in English, and print
them using e.g. GNU gettext (3). If a translation file (provided by relevant
translation team) exists for the language specified by the user's locale,
this causes the message to be printed in that language. The translated
message itself may be merely a template, which includes placeholders for
inserting runtime data).

Emacs allows you to mix several scripts (a.k.a. "languages") in the
> same buffer, so it is no longer clear in what "language" the document
> is written.
>
> > Don't know about "should" (because as you said, both of them look
> "wrong").
> > However if you let the standard unicode algorithm reorder the logical
> string
> > "^B is undefined" with the default auto-detected directionality, it
> really
> > does result with what you seem to expect (the circumflex (0x5e) is a
> > neutral, and gets the directionality of the run). Maybe this is not
> really a
> > circumflex, or maybe some other magic is at work here.
>
> If "^" were a normal character, I'd agree (and Emacs would then render
> them automatically per UAX#9 anyway).  But this is not the case.
> Here, the string ^B or B^ is a display feature; the display engine
> produces these two characters as a single display element, and cursor
> motion treats them both as a single atomic entity.  The question is:
> within that atomic entity, how should we display the "^" part?
>
>
OK, that kind of "other magic" then :-)


> Don't get me wrong: if the consensus is that we should display this as
> if we had 2 distinct characters, using UAX#9 reordering rules, I'm
> okay with that.
>
>
To me at least, it does seem better to show it as in UAX#9.
However, it seems that I cannot reproduce the scenario at the moment (see
below).

> From a brief check, on Linux with X,  with Hebrew and English
> > layouts, situation seems to be like that:
> >
> > 1) On the basic X level (I used xev to test) there is a "state" (binary
> > flags, indicate e.g. if ctrl was held, and also the "group" i.e. if we
> are
> > in Hebrew or English mode), keycode (a number, which is the same for "א"
> and
> > "t"), and an "XLookupString" which is the same (14) for both "ctrl-t" and
> > "ctrl-א" (but does differentiate between them if ctrl is not held). xev
> >  also reports "keysym" which is the unicode point for "t" in both cases
> > (ctrl-t and ctrl-א), but is the unicode point for א if control is not
> > pressed.
> > 2) In gtk (a higher level interface), there is "gdk_keyval_name", which
> is
> > either "א" or "t" according to the current layout (language mode).
> Whether
> > or not ctrl was down is determined by the mask GDK_CONTROL_MASK in the
> state
> > of the event.
> >
> >  Note that at both levels there is no specific code for "ctrl-א".
> Whatever
> > it is that emacs sees is either generated by some higher level function
> that
> > I am not aware of, or generated within emacs itself. Probably we should
> look
> > it up in the code.
>
> Instead of looking in the code, it is much easier to put the cursor on
> the ctrl-א thing, and type "C-u C-x =".  Then Emacs will tell you what
> it thinks about this character, including its codepoint.
>
> Could you please do this?  I need to know that in order to understand
> why Emacs treats this "character" as strong R.  I cannot produce this
> strange character on MS-Windows, or else I'd do this myself.
>

Not sure how to do that. It only appears in the echo area and I cannot
insert it in a buffer (the message disappears if I try to click the
minibuffer or move the cursor there using keyboard shortcuts). By the way,
the message that I see is "C-א not defined", not ^א as Larry described.
I tried binding the key to self-insert-command, and then I get a regular א
inserted into the buffer.
Actually, while typing the above, I realized that while I was trying to bind
the key, I had C-א appearing in the mini-buffer. Checking, I saw that in
that scenarion I can actually move the cursor around to it, and use C-u C-x
=. However, this reveals that the C-א displayed there is actually three
characters (C, -, א)...

   AA
_______________________________________________
emacs-bidi mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/emacs-bidi

Reply via email to