On 5 March 2011 14:51, Bruno Haible <br...@clisp.org> wrote: > Hello Reuben, > >> regex-quote seems only to support two syntaxes at the moment > > Yes. POSIX specifies two syntaxes.
regex.h suggests that in practice there are a couple more: RE_SYNTAX_POSIX_EGREP RE_SYNTAX_POSIX_AWK each of which is different from the other and from POSIX basic and extended. > Rather it's an 'int' with the same meaning as the cflags argument that you > pass to regcomp(). Any non-zero value counts as selecting extended syntax in regex_quote*, whereas in regcomp only one bit does that. (I point this out only as a potential source of ABI breakage.) > True, but on the other hand if the caller is supposed to determine the > characters to be escaped ad-hoc, the risk of mistake is pretty high. > On the other hand, 'grep' supports basic, extended, and PCRE syntaxes, > but not the Emacs syntax. Presumably it supports not RE_SYNTAX_POSIX_EXTENDED but rather RE_SYNTAX_POSIX_EGREP? Or both? > Before we can decide on this, IMO some analysis is needed: > > - What are the possible effects of reg_syntax_t on the string of > characters to be escaped? I can see > RE_BK_PLUS_QM -> +? > RE_INTERVALS, RE_NO_BK_BRACES -> {} > What other relations are there? RE_NO_BK_PARENS -> () RE_NO_BK_VBAR -> | RE_NO_BK_REFS -> [:digit:] > - What characters need to be escaped in Emacs syntax? Emacs syntax is simply the syntax with all the bits switched off, so: $^.*[]\+? > - What characters need to be escaped in PCRE syntax? According to pcrepattern(3): ^$.[|()?*+{ (Which makes me wonder why we treat ] as special in regex-quote.c.) > - Do Emacs and PCRE view a regex as a sequence of bytes or as a sequence > of multibyte characters in the locale encoding (given by LC_CTYPE)? PCRE doesn't do locales; it treats strings as either bytes or, given a specific flag, UTF-8. I don't really understand the question about Emacs: someone using regex-quote in their own programs is worried about Emacs syntax, not Emacs encodings, because Emacs doesn't have a C API. My understanding of Emacs is that it has its own universal internal encoding, which differs from the encoding of a particular buffer being edited; the latter can be bytes, 7-bit or 8-bit characters, or multibyte characters, according to the file being editor and the user's selected encoding. HTH! -- http://rrt.sc3d.org