Re: regex-quote.c syntax support

Reuben Thomas Sat, 05 Mar 2011 08:31:14 -0800

On 5 March 2011 14:51, Bruno Haible <br...@clisp.org> wrote:
> Hello Reuben,
>
>> regex-quote seems only to support two syntaxes at the moment
>
> Yes. POSIX specifies two syntaxes.


regex.h suggests that in practice there are a couple more:

RE_SYNTAX_POSIX_EGREP
RE_SYNTAX_POSIX_AWK

each of which is different from the other and from POSIX basic and extended.

> Rather it's an 'int' with the same meaning as the cflags argument that you
> pass to regcomp().

Any non-zero value counts as selecting extended syntax in
regex_quote*, whereas in regcomp only one bit does that. (I point this
out only as a potential source of ABI breakage.)

> True, but on the other hand if the caller is supposed to determine the
> characters to be escaped ad-hoc, the risk of mistake is pretty high.

> On the other hand, 'grep' supports basic, extended, and PCRE syntaxes,
> but not the Emacs syntax.

Presumably it supports not RE_SYNTAX_POSIX_EXTENDED but rather
RE_SYNTAX_POSIX_EGREP? Or both?

> Before we can decide on this, IMO some analysis is needed:
>
>  - What are the possible effects of reg_syntax_t on the string of
>    characters to be escaped? I can see
>      RE_BK_PLUS_QM                   ->    +?
>      RE_INTERVALS, RE_NO_BK_BRACES   ->    {}
>    What other relations are there?

RE_NO_BK_PARENS -> ()
RE_NO_BK_VBAR -> |
RE_NO_BK_REFS -> [:digit:]

>  - What characters need to be escaped in Emacs syntax?

Emacs syntax is simply the syntax with all the bits switched off, so:

$^.*[]\+?

>  - What characters need to be escaped in PCRE syntax?

According to pcrepattern(3):

^$.[|()?*+{

(Which makes me wonder why we treat ] as special in regex-quote.c.)

>  - Do Emacs and PCRE view a regex as a sequence of bytes or as a sequence
>    of multibyte characters in the locale encoding (given by LC_CTYPE)?

PCRE doesn't do locales; it treats strings as either bytes or, given a
specific flag, UTF-8.

I don't really understand the question about Emacs: someone using
regex-quote in their own programs is worried about Emacs syntax, not
Emacs encodings, because Emacs doesn't have a C API. My understanding
of Emacs is that it has its own universal internal encoding, which
differs from the encoding of a particular buffer being edited; the
latter can be bytes, 7-bit or 8-bit characters, or multibyte
characters, according to the file being editor and the user's selected
encoding.

HTH!

-- 
http://rrt.sc3d.org

Re: regex-quote.c syntax support

Reply via email to