Modified: tomcat/jk/trunk/native/iis/pcre/doc/pcre.txt URL: http://svn.apache.org/viewvc/tomcat/jk/trunk/native/iis/pcre/doc/pcre.txt?rev=1815927&r1=1815926&r2=1815927&view=diff ============================================================================== --- tomcat/jk/trunk/native/iis/pcre/doc/pcre.txt (original) +++ tomcat/jk/trunk/native/iis/pcre/doc/pcre.txt Tue Nov 21 14:37:37 2017 @@ -4640,7 +4640,7 @@ DIFFERENCES BETWEEN PCRE AND PERL pattern names is not as general as Perl's. This is a consequence of the fact the PCRE works internally just with numbers, using an external ta- ble to translate between numbers and names. In particular, a pattern - such as (?|(?<a>A)|(?<b)B), where the two capturing parentheses have + such as (?|(?<a>A)|(?<b>B), where the two capturing parentheses have the same number but different names, is not supported, and causes an error at compile time. If it were allowed, it would not be possible to distinguish which parentheses matched, because both names map to cap- @@ -5028,55 +5028,56 @@ BACKSLASH ate the appropriate EBCDIC code values. The \c escape is processed as specified for Perl in the perlebcdic document. The only characters that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. - Any other character provokes a compile-time error. The sequence \@ - encodes character code 0; the letters (in either case) encode charac- - ters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31 - (hex 1B to hex 1F), and \? becomes either 255 (hex FF) or 95 (hex 5F). - - Thus, apart from \?, these escapes generate the same character code - values as they do in an ASCII environment, though the meanings of the - values mostly differ. For example, \G always generates code value 7, + Any other character provokes a compile-time error. The sequence \c@ + encodes character code 0; after \c the letters (in either case) encode + characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters + 27-31 (hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95 + (hex 5F). + + Thus, apart from \c?, these escapes generate the same character code + values as they do in an ASCII environment, though the meanings of the + values mostly differ. For example, \cG always generates code value 7, which is BEL in ASCII but DEL in EBCDIC. - The sequence \? generates DEL (127, hex 7F) in an ASCII environment, - but because 127 is not a control character in EBCDIC, Perl makes it - generate the APC character. Unfortunately, there are several variants - of EBCDIC. In most of them the APC character has the value 255 (hex - FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If - certain other characters have POSIX-BC values, PCRE makes \? generate + The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, + but because 127 is not a control character in EBCDIC, Perl makes it + generate the APC character. Unfortunately, there are several variants + of EBCDIC. In most of them the APC character has the value 255 (hex + FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If + certain other characters have POSIX-BC values, PCRE makes \c? generate 95; otherwise it generates 255. - After \0 up to two further octal digits are read. If there are fewer - than two digits, just those that are present are used. Thus the + After \0 up to two further octal digits are read. If there are fewer + than two digits, just those that are present are used. Thus the sequence \0\x\015 specifies two binary zeros followed by a CR character (code value 13). Make sure you supply two digits after the initial zero if the pattern character that follows is itself an octal digit. - The escape \o must be followed by a sequence of octal digits, enclosed - in braces. An error occurs if this is not the case. This escape is a - recent addition to Perl; it provides way of specifying character code - points as octal numbers greater than 0777, and it also allows octal + The escape \o must be followed by a sequence of octal digits, enclosed + in braces. An error occurs if this is not the case. This escape is a + recent addition to Perl; it provides way of specifying character code + points as octal numbers greater than 0777, and it also allows octal numbers and back references to be unambiguously specified. For greater clarity and unambiguity, it is best to avoid following \ by a digit greater than zero. Instead, use \o{} or \x{} to specify charac- - ter numbers, and \g{} to specify back references. The following para- + ter numbers, and \g{} to specify back references. The following para- graphs describe the old, ambiguous syntax. The handling of a backslash followed by a digit other than 0 is compli- - cated, and Perl has changed in recent releases, causing PCRE also to + cated, and Perl has changed in recent releases, causing PCRE also to change. Outside a character class, PCRE reads the digit and any follow- - ing digits as a decimal number. If the number is less than 8, or if - there have been at least that many previous capturing left parentheses - in the expression, the entire sequence is taken as a back reference. A - description of how this works is given later, following the discussion + ing digits as a decimal number. If the number is less than 8, or if + there have been at least that many previous capturing left parentheses + in the expression, the entire sequence is taken as a back reference. A + description of how this works is given later, following the discussion of parenthesized subpatterns. - Inside a character class, or if the decimal number following \ is + Inside a character class, or if the decimal number following \ is greater than 7 and there have not been that many capturing subpatterns, - PCRE handles \8 and \9 as the literal characters "8" and "9", and oth- + PCRE handles \8 and \9 as the literal characters "8" and "9", and oth- erwise re-reads up to three octal digits following the backslash, using - them to generate a data character. Any subsequent digits stand for + them to generate a data character. Any subsequent digits stand for themselves. For example: \040 is another way of writing an ASCII space @@ -5094,31 +5095,31 @@ BACKSLASH \81 is either a back reference, or the two characters "8" and "1" - Note that octal values of 100 or greater that are specified using this - syntax must not be introduced by a leading zero, because no more than + Note that octal values of 100 or greater that are specified using this + syntax must not be introduced by a leading zero, because no more than three octal digits are ever read. - By default, after \x that is not followed by {, from zero to two hexa- - decimal digits are read (letters can be in upper or lower case). Any + By default, after \x that is not followed by {, from zero to two hexa- + decimal digits are read (letters can be in upper or lower case). Any number of hexadecimal digits may appear between \x{ and }. If a charac- - ter other than a hexadecimal digit appears between \x{ and }, or if + ter other than a hexadecimal digit appears between \x{ and }, or if there is no terminating }, an error occurs. - If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x - is as just described only when it is followed by two hexadecimal dig- - its. Otherwise, it matches a literal "x" character. In JavaScript + If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x + is as just described only when it is followed by two hexadecimal dig- + its. Otherwise, it matches a literal "x" character. In JavaScript mode, support for code points greater than 256 is provided by \u, which - must be followed by four hexadecimal digits; otherwise it matches a + must be followed by four hexadecimal digits; otherwise it matches a literal "u" character. Characters whose value is less than 256 can be defined by either of the - two syntaxes for \x (or by \u in JavaScript mode). There is no differ- + two syntaxes for \x (or by \u in JavaScript mode). There is no differ- ence in the way they are handled. For example, \xdc is exactly the same as \x{dc} (or \u00dc in JavaScript mode). Constraints on character values - Characters that are specified using octal or hexadecimal numbers are + Characters that are specified using octal or hexadecimal numbers are limited to certain values, as follows: 8-bit non-UTF mode less than 0x100 @@ -5128,44 +5129,44 @@ BACKSLASH 32-bit non-UTF mode less than 0x100000000 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint - Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so- + Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so- called "surrogate" codepoints), and 0xffef. Escape sequences in character classes All the sequences that define a single character value can be used both - inside and outside character classes. In addition, inside a character + inside and outside character classes. In addition, inside a character class, \b is interpreted as the backspace character (hex 08). - \N is not allowed in a character class. \B, \R, and \X are not special - inside a character class. Like other unrecognized escape sequences, - they are treated as the literal characters "B", "R", and "X" by - default, but cause an error if the PCRE_EXTRA option is set. Outside a + \N is not allowed in a character class. \B, \R, and \X are not special + inside a character class. Like other unrecognized escape sequences, + they are treated as the literal characters "B", "R", and "X" by + default, but cause an error if the PCRE_EXTRA option is set. Outside a character class, these sequences have different meanings. Unsupported escape sequences - In Perl, the sequences \l, \L, \u, and \U are recognized by its string - handler and used to modify the case of following characters. By - default, PCRE does not support these escape sequences. However, if the - PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and + In Perl, the sequences \l, \L, \u, and \U are recognized by its string + handler and used to modify the case of following characters. By + default, PCRE does not support these escape sequences. However, if the + PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and \u can be used to define a character by code point, as described in the previous section. Absolute and relative back references - The sequence \g followed by an unsigned or a negative number, option- - ally enclosed in braces, is an absolute or relative back reference. A + The sequence \g followed by an unsigned or a negative number, option- + ally enclosed in braces, is an absolute or relative back reference. A named back reference can be coded as \g{name}. Back references are dis- cussed later, following the discussion of parenthesized subpatterns. Absolute and relative subroutine calls - For compatibility with Oniguruma, the non-Perl syntax \g followed by a + For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or a number enclosed either in angle brackets or single quotes, is - an alternative syntax for referencing a subpattern as a "subroutine". - Details are discussed later. Note that \g{...} (Perl syntax) and - \g<...> (Oniguruma syntax) are not synonymous. The former is a back + an alternative syntax for referencing a subpattern as a "subroutine". + Details are discussed later. Note that \g{...} (Perl syntax) and + \g<...> (Oniguruma syntax) are not synonymous. The former is a back reference; the latter is a subroutine call. Generic character types @@ -5184,59 +5185,59 @@ BACKSLASH \W any "non-word" character There is also the single sequence \N, which matches a non-newline char- - acter. This is the same as the "." metacharacter when PCRE_DOTALL is - not set. Perl also uses \N to match characters by name; PCRE does not + acter. This is the same as the "." metacharacter when PCRE_DOTALL is + not set. Perl also uses \N to match characters by name; PCRE does not support this. - Each pair of lower and upper case escape sequences partitions the com- - plete set of characters into two disjoint sets. Any given character - matches one, and only one, of each pair. The sequences can appear both - inside and outside character classes. They each match one character of - the appropriate type. If the current matching point is at the end of - the subject string, all of them fail, because there is no character to + Each pair of lower and upper case escape sequences partitions the com- + plete set of characters into two disjoint sets. Any given character + matches one, and only one, of each pair. The sequences can appear both + inside and outside character classes. They each match one character of + the appropriate type. If the current matching point is at the end of + the subject string, all of them fail, because there is no character to match. - For compatibility with Perl, \s did not used to match the VT character - (code 11), which made it different from the the POSIX "space" class. - However, Perl added VT at release 5.18, and PCRE followed suit at - release 8.34. The default \s characters are now HT (9), LF (10), VT - (11), FF (12), CR (13), and space (32), which are defined as white + For compatibility with Perl, \s did not used to match the VT character + (code 11), which made it different from the the POSIX "space" class. + However, Perl added VT at release 5.18, and PCRE followed suit at + release 8.34. The default \s characters are now HT (9), LF (10), VT + (11), FF (12), CR (13), and space (32), which are defined as white space in the "C" locale. This list may vary if locale-specific matching - is taking place. For example, in some locales the "non-breaking space" - character (\xA0) is recognized as white space, and in others the VT + is taking place. For example, in some locales the "non-breaking space" + character (\xA0) is recognized as white space, and in others the VT character is not. - A "word" character is an underscore or any character that is a letter - or digit. By default, the definition of letters and digits is con- - trolled by PCRE's low-valued character tables, and may vary if locale- - specific matching is taking place (see "Locale support" in the pcreapi - page). For example, in a French locale such as "fr_FR" in Unix-like - systems, or "french" in Windows, some character codes greater than 127 - are used for accented letters, and these are then matched by \w. The + A "word" character is an underscore or any character that is a letter + or digit. By default, the definition of letters and digits is con- + trolled by PCRE's low-valued character tables, and may vary if locale- + specific matching is taking place (see "Locale support" in the pcreapi + page). For example, in a French locale such as "fr_FR" in Unix-like + systems, or "french" in Windows, some character codes greater than 127 + are used for accented letters, and these are then matched by \w. The use of locales with Unicode is discouraged. - By default, characters whose code points are greater than 127 never + By default, characters whose code points are greater than 127 never match \d, \s, or \w, and always match \D, \S, and \W, although this may - vary for characters in the range 128-255 when locale-specific matching - is happening. These escape sequences retain their original meanings - from before Unicode support was available, mainly for efficiency rea- - sons. If PCRE is compiled with Unicode property support, and the - PCRE_UCP option is set, the behaviour is changed so that Unicode prop- + vary for characters in the range 128-255 when locale-specific matching + is happening. These escape sequences retain their original meanings + from before Unicode support was available, mainly for efficiency rea- + sons. If PCRE is compiled with Unicode property support, and the + PCRE_UCP option is set, the behaviour is changed so that Unicode prop- erties are used to determine character types, as follows: \d any character that matches \p{Nd} (decimal digit) \s any character that matches \p{Z} or \h or \v \w any character that matches \p{L} or \p{N}, plus underscore - The upper case escapes match the inverse sets of characters. Note that - \d matches only decimal digits, whereas \w matches any Unicode digit, - as well as any Unicode letter, and underscore. Note also that PCRE_UCP - affects \b, and \B because they are defined in terms of \w and \W. + The upper case escapes match the inverse sets of characters. Note that + \d matches only decimal digits, whereas \w matches any Unicode digit, + as well as any Unicode letter, and underscore. Note also that PCRE_UCP + affects \b, and \B because they are defined in terms of \w and \W. Matching these sequences is noticeably slower when PCRE_UCP is set. - The sequences \h, \H, \v, and \V are features that were added to Perl - at release 5.10. In contrast to the other sequences, which match only - ASCII characters by default, these always match certain high-valued + The sequences \h, \H, \v, and \V are features that were added to Perl + at release 5.10. In contrast to the other sequences, which match only + ASCII characters by default, these always match certain high-valued code points, whether or not PCRE_UCP is set. The horizontal space char- acters are: @@ -5275,110 +5276,110 @@ BACKSLASH Newline sequences - Outside a character class, by default, the escape sequence \R matches - any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent + Outside a character class, by default, the escape sequence \R matches + any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the following: (?>\r\n|\n|\x0b|\f|\r|\x85) - This is an example of an "atomic group", details of which are given + This is an example of an "atomic group", details of which are given below. This particular group matches either the two-character sequence - CR followed by LF, or one of the single characters LF (linefeed, - U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car- - riage return, U+000D), or NEL (next line, U+0085). The two-character + CR followed by LF, or one of the single characters LF (linefeed, + U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car- + riage return, U+000D), or NEL (next line, U+0085). The two-character sequence is treated as a single unit that cannot be split. - In other modes, two additional characters whose codepoints are greater + In other modes, two additional characters whose codepoints are greater than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- - rator, U+2029). Unicode character property support is not needed for + rator, U+2029). Unicode character property support is not needed for these characters to be recognized. It is possible to restrict \R to match only CR, LF, or CRLF (instead of - the complete set of Unicode line endings) by setting the option + the complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. (BSR is an abbrevation for "backslash R".) This can be made the default - when PCRE is built; if this is the case, the other behaviour can be - requested via the PCRE_BSR_UNICODE option. It is also possible to - specify these settings by starting a pattern string with one of the + when PCRE is built; if this is the case, the other behaviour can be + requested via the PCRE_BSR_UNICODE option. It is also possible to + specify these settings by starting a pattern string with one of the following sequences: (*BSR_ANYCRLF) CR, LF, or CRLF only (*BSR_UNICODE) any Unicode newline sequence These override the default and the options given to the compiling func- - tion, but they can themselves be overridden by options given to a - matching function. Note that these special settings, which are not - Perl-compatible, are recognized only at the very start of a pattern, - and that they must be in upper case. If more than one of them is - present, the last one is used. They can be combined with a change of + tion, but they can themselves be overridden by options given to a + matching function. Note that these special settings, which are not + Perl-compatible, are recognized only at the very start of a pattern, + and that they must be in upper case. If more than one of them is + present, the last one is used. They can be combined with a change of newline convention; for example, a pattern can start with: (*ANY)(*BSR_ANYCRLF) - They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) + They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) or (*UCP) special sequences. Inside a character class, \R is treated as - an unrecognized escape sequence, and so matches the letter "R" by + an unrecognized escape sequence, and so matches the letter "R" by default, but causes an error if PCRE_EXTRA is set. Unicode character properties When PCRE is built with Unicode character property support, three addi- - tional escape sequences that match characters with specific properties - are available. When in 8-bit non-UTF-8 mode, these sequences are of - course limited to testing characters whose codepoints are less than + tional escape sequences that match characters with specific properties + are available. When in 8-bit non-UTF-8 mode, these sequences are of + course limited to testing characters whose codepoints are less than 256, but they do work in this mode. The extra escape sequences are: \p{xx} a character with the xx property \P{xx} a character without the xx property \X a Unicode extended grapheme cluster - The property names represented by xx above are limited to the Unicode + The property names represented by xx above are limited to the Unicode script names, the general category properties, "Any", which matches any - character (including newline), and some special PCRE properties - (described in the next section). Other Perl properties such as "InMu- - sicalSymbols" are not currently supported by PCRE. Note that \P{Any} + character (including newline), and some special PCRE properties + (described in the next section). Other Perl properties such as "InMu- + sicalSymbols" are not currently supported by PCRE. Note that \P{Any} does not match any characters, so always causes a match failure. Sets of Unicode characters are defined as belonging to certain scripts. - A character from one of these sets can be matched using a script name. + A character from one of these sets can be matched using a script name. For example: \p{Greek} \P{Han} - Those that are not part of an identified script are lumped together as + Those that are not part of an identified script are lumped together as "Common". The current list of scripts is: - Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Bengali, - Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Car- + Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Bengali, + Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Car- ian, Caucasian_Albanian, Chakma, Cham, Cherokee, Common, Coptic, Cunei- form, Cypriot, Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hiero- glyphs, Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, - Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, - Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip- - tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, - Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Lin- - ear_A, Linear_B, Lisu, Lycian, Lydian, Mahajani, Malayalam, Mandaic, - Manichaean, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, - Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Myanmar, Nabataean, - New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_North_Arabian, + Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, + Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip- + tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, + Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Lin- + ear_A, Linear_B, Lisu, Lycian, Lydian, Mahajani, Malayalam, Mandaic, + Manichaean, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, + Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Myanmar, Nabataean, + New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician, - Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha- - vian, Siddham, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, - Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, - Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, + Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha- + vian, Siddham, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, + Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, + Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi. Each character has exactly one Unicode general category property, spec- - ified by a two-letter abbreviation. For compatibility with Perl, nega- - tion can be specified by including a circumflex between the opening - brace and the property name. For example, \p{^Lu} is the same as + ified by a two-letter abbreviation. For compatibility with Perl, nega- + tion can be specified by including a circumflex between the opening + brace and the property name. For example, \p{^Lu} is the same as \P{Lu}. If only one letter is specified with \p or \P, it includes all the gen- - eral category properties that start with that letter. In this case, in - the absence of negation, the curly brackets in the escape sequence are + eral category properties that start with that letter. In this case, in + the absence of negation, the curly brackets in the escape sequence are optional; these two examples have the same effect: \p{L} @@ -5430,73 +5431,73 @@ BACKSLASH Zp Paragraph separator Zs Space separator - The special property L& is also supported: it matches a character that - has the Lu, Ll, or Lt property, in other words, a letter that is not + The special property L& is also supported: it matches a character that + has the Lu, Ll, or Lt property, in other words, a letter that is not classified as a modifier or "other". - The Cs (Surrogate) property applies only to characters in the range - U+D800 to U+DFFF. Such characters are not valid in Unicode strings and - so cannot be tested by PCRE, unless UTF validity checking has been + The Cs (Surrogate) property applies only to characters in the range + U+D800 to U+DFFF. Such characters are not valid in Unicode strings and + so cannot be tested by PCRE, unless UTF validity checking has been turned off (see the discussion of PCRE_NO_UTF8_CHECK, - PCRE_NO_UTF16_CHECK and PCRE_NO_UTF32_CHECK in the pcreapi page). Perl + PCRE_NO_UTF16_CHECK and PCRE_NO_UTF32_CHECK in the pcreapi page). Perl does not support the Cs property. - The long synonyms for property names that Perl supports (such as - \p{Letter}) are not supported by PCRE, nor is it permitted to prefix + The long synonyms for property names that Perl supports (such as + \p{Letter}) are not supported by PCRE, nor is it permitted to prefix any of these properties with "Is". No character that is in the Unicode table has the Cn (unassigned) prop- erty. Instead, this property is assumed for any code point that is not in the Unicode table. - Specifying caseless matching does not affect these escape sequences. - For example, \p{Lu} always matches only upper case letters. This is + Specifying caseless matching does not affect these escape sequences. + For example, \p{Lu} always matches only upper case letters. This is different from the behaviour of current versions of Perl. - Matching characters by Unicode property is not fast, because PCRE has - to do a multistage table lookup in order to find a character's prop- + Matching characters by Unicode property is not fast, because PCRE has + to do a multistage table lookup in order to find a character's prop- erty. That is why the traditional escape sequences such as \d and \w do not use Unicode properties in PCRE by default, though you can make them - do so by setting the PCRE_UCP option or by starting the pattern with + do so by setting the PCRE_UCP option or by starting the pattern with (*UCP). Extended grapheme clusters - The \X escape matches any number of Unicode characters that form an + The \X escape matches any number of Unicode characters that form an "extended grapheme cluster", and treats the sequence as an atomic group - (see below). Up to and including release 8.31, PCRE matched an ear- + (see below). Up to and including release 8.31, PCRE matched an ear- lier, simpler definition that was equivalent to (?>\PM\pM*) - That is, it matched a character without the "mark" property, followed - by zero or more characters with the "mark" property. Characters with - the "mark" property are typically non-spacing accents that affect the + That is, it matched a character without the "mark" property, followed + by zero or more characters with the "mark" property. Characters with + the "mark" property are typically non-spacing accents that affect the preceding character. - This simple definition was extended in Unicode to include more compli- - cated kinds of composite character by giving each character a grapheme - breaking property, and creating rules that use these properties to - define the boundaries of extended grapheme clusters. In releases of + This simple definition was extended in Unicode to include more compli- + cated kinds of composite character by giving each character a grapheme + breaking property, and creating rules that use these properties to + define the boundaries of extended grapheme clusters. In releases of PCRE later than 8.31, \X matches one of these clusters. - \X always matches at least one character. Then it decides whether to + \X always matches at least one character. Then it decides whether to add additional characters according to the following rules for ending a cluster: 1. End at the end of the subject string. - 2. Do not end between CR and LF; otherwise end after any control char- + 2. Do not end between CR and LF; otherwise end after any control char- acter. - 3. Do not break Hangul (a Korean script) syllable sequences. Hangul - characters are of five types: L, V, T, LV, and LVT. An L character may - be followed by an L, V, LV, or LVT character; an LV or V character may + 3. Do not break Hangul (a Korean script) syllable sequences. Hangul + characters are of five types: L, V, T, LV, and LVT. An L character may + be followed by an L, V, LV, or LVT character; an LV or V character may be followed by a V or T character; an LVT or T character may be follwed only by a T character. - 4. Do not end before extending characters or spacing marks. Characters - with the "mark" property always have the "extend" grapheme breaking + 4. Do not end before extending characters or spacing marks. Characters + with the "mark" property always have the "extend" grapheme breaking property. 5. Do not end after prepend characters. @@ -5505,9 +5506,9 @@ BACKSLASH PCRE's additional properties - As well as the standard Unicode properties described above, PCRE sup- - ports four more that make it possible to convert traditional escape - sequences such as \w and \s to use Unicode properties. PCRE uses these + As well as the standard Unicode properties described above, PCRE sup- + ports four more that make it possible to convert traditional escape + sequences such as \w and \s to use Unicode properties. PCRE uses these non-standard, non-Perl properties internally when PCRE_UCP is set. How- ever, they may also be used explicitly. These properties are: @@ -5516,54 +5517,54 @@ BACKSLASH Xsp Any Perl space character Xwd Any Perl "word" character - Xan matches characters that have either the L (letter) or the N (num- - ber) property. Xps matches the characters tab, linefeed, vertical tab, - form feed, or carriage return, and any other character that has the Z - (separator) property. Xsp is the same as Xps; it used to exclude ver- - tical tab, for Perl compatibility, but Perl changed, and so PCRE fol- - lowed at release 8.34. Xwd matches the same characters as Xan, plus + Xan matches characters that have either the L (letter) or the N (num- + ber) property. Xps matches the characters tab, linefeed, vertical tab, + form feed, or carriage return, and any other character that has the Z + (separator) property. Xsp is the same as Xps; it used to exclude ver- + tical tab, for Perl compatibility, but Perl changed, and so PCRE fol- + lowed at release 8.34. Xwd matches the same characters as Xan, plus underscore. - There is another non-standard property, Xuc, which matches any charac- - ter that can be represented by a Universal Character Name in C++ and - other programming languages. These are the characters $, @, ` (grave - accent), and all characters with Unicode code points greater than or - equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that - most base (ASCII) characters are excluded. (Universal Character Names - are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit. + There is another non-standard property, Xuc, which matches any charac- + ter that can be represented by a Universal Character Name in C++ and + other programming languages. These are the characters $, @, ` (grave + accent), and all characters with Unicode code points greater than or + equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that + most base (ASCII) characters are excluded. (Universal Character Names + are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit. Note that the Xuc property does not match these sequences but the char- acters that they represent.) Resetting the match start - The escape sequence \K causes any previously matched characters not to + The escape sequence \K causes any previously matched characters not to be included in the final matched sequence. For example, the pattern: foo\Kbar - matches "foobar", but reports that it has matched "bar". This feature - is similar to a lookbehind assertion (described below). However, in - this case, the part of the subject before the real match does not have - to be of fixed length, as lookbehind assertions do. The use of \K does - not interfere with the setting of captured substrings. For example, + matches "foobar", but reports that it has matched "bar". This feature + is similar to a lookbehind assertion (described below). However, in + this case, the part of the subject before the real match does not have + to be of fixed length, as lookbehind assertions do. The use of \K does + not interfere with the setting of captured substrings. For example, when the pattern (foo)\Kbar matches "foobar", the first substring is still set to "foo". - Perl documents that the use of \K within assertions is "not well - defined". In PCRE, \K is acted upon when it occurs inside positive - assertions, but is ignored in negative assertions. Note that when a - pattern such as (?=ab\K) matches, the reported start of the match can + Perl documents that the use of \K within assertions is "not well + defined". In PCRE, \K is acted upon when it occurs inside positive + assertions, but is ignored in negative assertions. Note that when a + pattern such as (?=ab\K) matches, the reported start of the match can be greater than the end of the match. Simple assertions - The final use of backslash is for certain simple assertions. An asser- - tion specifies a condition that has to be met at a particular point in - a match, without consuming any characters from the subject string. The - use of subpatterns for more complicated assertions is described below. + The final use of backslash is for certain simple assertions. An asser- + tion specifies a condition that has to be met at a particular point in + a match, without consuming any characters from the subject string. The + use of subpatterns for more complicated assertions is described below. The backslashed assertions are: \b matches at a word boundary @@ -5574,161 +5575,161 @@ BACKSLASH \z matches only at the end of the subject \G matches at the first matching position in the subject - Inside a character class, \b has a different meaning; it matches the - backspace character. If any other of these assertions appears in a - character class, by default it matches the corresponding literal char- + Inside a character class, \b has a different meaning; it matches the + backspace character. If any other of these assertions appears in a + character class, by default it matches the corresponding literal char- acter (for example, \B matches the letter B). However, if the - PCRE_EXTRA option is set, an "invalid escape sequence" error is gener- + PCRE_EXTRA option is set, an "invalid escape sequence" error is gener- ated instead. - A word boundary is a position in the subject string where the current - character and the previous character do not both match \w or \W (i.e. - one matches \w and the other matches \W), or the start or end of the - string if the first or last character matches \w, respectively. In a - UTF mode, the meanings of \w and \W can be changed by setting the - PCRE_UCP option. When this is done, it also affects \b and \B. Neither - PCRE nor Perl has a separate "start of word" or "end of word" metase- - quence. However, whatever follows \b normally determines which it is. + A word boundary is a position in the subject string where the current + character and the previous character do not both match \w or \W (i.e. + one matches \w and the other matches \W), or the start or end of the + string if the first or last character matches \w, respectively. In a + UTF mode, the meanings of \w and \W can be changed by setting the + PCRE_UCP option. When this is done, it also affects \b and \B. Neither + PCRE nor Perl has a separate "start of word" or "end of word" metase- + quence. However, whatever follows \b normally determines which it is. For example, the fragment \ba matches "a" at the start of a word. - The \A, \Z, and \z assertions differ from the traditional circumflex + The \A, \Z, and \z assertions differ from the traditional circumflex and dollar (described in the next section) in that they only ever match - at the very start and end of the subject string, whatever options are - set. Thus, they are independent of multiline mode. These three asser- + at the very start and end of the subject string, whatever options are + set. Thus, they are independent of multiline mode. These three asser- tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which - affect only the behaviour of the circumflex and dollar metacharacters. - However, if the startoffset argument of pcre_exec() is non-zero, indi- + affect only the behaviour of the circumflex and dollar metacharacters. + However, if the startoffset argument of pcre_exec() is non-zero, indi- cating that matching is to start at a point other than the beginning of - the subject, \A can never match. The difference between \Z and \z is + the subject, \A can never match. The difference between \Z and \z is that \Z matches before a newline at the end of the string as well as at the very end, whereas \z matches only at the end. - The \G assertion is true only when the current matching position is at - the start point of the match, as specified by the startoffset argument - of pcre_exec(). It differs from \A when the value of startoffset is - non-zero. By calling pcre_exec() multiple times with appropriate argu- + The \G assertion is true only when the current matching position is at + the start point of the match, as specified by the startoffset argument + of pcre_exec(). It differs from \A when the value of startoffset is + non-zero. By calling pcre_exec() multiple times with appropriate argu- ments, you can mimic Perl's /g option, and it is in this kind of imple- mentation where \G can be useful. - Note, however, that PCRE's interpretation of \G, as the start of the + Note, however, that PCRE's interpretation of \G, as the start of the current match, is subtly different from Perl's, which defines it as the - end of the previous match. In Perl, these can be different when the - previously matched string was empty. Because PCRE does just one match + end of the previous match. In Perl, these can be different when the + previously matched string was empty. Because PCRE does just one match at a time, it cannot reproduce this behaviour. - If all the alternatives of a pattern begin with \G, the expression is + If all the alternatives of a pattern begin with \G, the expression is anchored to the starting match position, and the "anchored" flag is set in the compiled regular expression. CIRCUMFLEX AND DOLLAR - The circumflex and dollar metacharacters are zero-width assertions. - That is, they test for a particular condition being true without con- + The circumflex and dollar metacharacters are zero-width assertions. + That is, they test for a particular condition being true without con- suming any characters from the subject string. Outside a character class, in the default matching mode, the circumflex - character is an assertion that is true only if the current matching - point is at the start of the subject string. If the startoffset argu- - ment of pcre_exec() is non-zero, circumflex can never match if the - PCRE_MULTILINE option is unset. Inside a character class, circumflex + character is an assertion that is true only if the current matching + point is at the start of the subject string. If the startoffset argu- + ment of pcre_exec() is non-zero, circumflex can never match if the + PCRE_MULTILINE option is unset. Inside a character class, circumflex has an entirely different meaning (see below). - Circumflex need not be the first character of the pattern if a number - of alternatives are involved, but it should be the first thing in each - alternative in which it appears if the pattern is ever to match that - branch. If all possible alternatives start with a circumflex, that is, - if the pattern is constrained to match only at the start of the sub- - ject, it is said to be an "anchored" pattern. (There are also other + Circumflex need not be the first character of the pattern if a number + of alternatives are involved, but it should be the first thing in each + alternative in which it appears if the pattern is ever to match that + branch. If all possible alternatives start with a circumflex, that is, + if the pattern is constrained to match only at the start of the sub- + ject, it is said to be an "anchored" pattern. (There are also other constructs that can cause a pattern to be anchored.) - The dollar character is an assertion that is true only if the current - matching point is at the end of the subject string, or immediately - before a newline at the end of the string (by default). Note, however, - that it does not actually match the newline. Dollar need not be the + The dollar character is an assertion that is true only if the current + matching point is at the end of the subject string, or immediately + before a newline at the end of the string (by default). Note, however, + that it does not actually match the newline. Dollar need not be the last character of the pattern if a number of alternatives are involved, - but it should be the last item in any branch in which it appears. Dol- + but it should be the last item in any branch in which it appears. Dol- lar has no special meaning in a character class. - The meaning of dollar can be changed so that it matches only at the - very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at + The meaning of dollar can be changed so that it matches only at the + very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This does not affect the \Z assertion. The meanings of the circumflex and dollar characters are changed if the - PCRE_MULTILINE option is set. When this is the case, a circumflex - matches immediately after internal newlines as well as at the start of - the subject string. It does not match after a newline that ends the - string. A dollar matches before any newlines in the string, as well as - at the very end, when PCRE_MULTILINE is set. When newline is specified - as the two-character sequence CRLF, isolated CR and LF characters do + PCRE_MULTILINE option is set. When this is the case, a circumflex + matches immediately after internal newlines as well as at the start of + the subject string. It does not match after a newline that ends the + string. A dollar matches before any newlines in the string, as well as + at the very end, when PCRE_MULTILINE is set. When newline is specified + as the two-character sequence CRLF, isolated CR and LF characters do not indicate newlines. - For example, the pattern /^abc$/ matches the subject string "def\nabc" - (where \n represents a newline) in multiline mode, but not otherwise. - Consequently, patterns that are anchored in single line mode because - all branches start with ^ are not anchored in multiline mode, and a - match for circumflex is possible when the startoffset argument of - pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if + For example, the pattern /^abc$/ matches the subject string "def\nabc" + (where \n represents a newline) in multiline mode, but not otherwise. + Consequently, patterns that are anchored in single line mode because + all branches start with ^ are not anchored in multiline mode, and a + match for circumflex is possible when the startoffset argument of + pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set. - Note that the sequences \A, \Z, and \z can be used to match the start - and end of the subject in both modes, and if all branches of a pattern - start with \A it is always anchored, whether or not PCRE_MULTILINE is + Note that the sequences \A, \Z, and \z can be used to match the start + and end of the subject in both modes, and if all branches of a pattern + start with \A it is always anchored, whether or not PCRE_MULTILINE is set. FULL STOP (PERIOD, DOT) AND \N Outside a character class, a dot in the pattern matches any one charac- - ter in the subject string except (by default) a character that signi- + ter in the subject string except (by default) a character that signi- fies the end of a line. - When a line ending is defined as a single character, dot never matches - that character; when the two-character sequence CRLF is used, dot does - not match CR if it is immediately followed by LF, but otherwise it - matches all characters (including isolated CRs and LFs). When any Uni- - code line endings are being recognized, dot does not match CR or LF or + When a line ending is defined as a single character, dot never matches + that character; when the two-character sequence CRLF is used, dot does + not match CR if it is immediately followed by LF, but otherwise it + matches all characters (including isolated CRs and LFs). When any Uni- + code line endings are being recognized, dot does not match CR or LF or any of the other line ending characters. - The behaviour of dot with regard to newlines can be changed. If the - PCRE_DOTALL option is set, a dot matches any one character, without + The behaviour of dot with regard to newlines can be changed. If the + PCRE_DOTALL option is set, a dot matches any one character, without exception. If the two-character sequence CRLF is present in the subject string, it takes two dots to match it. - The handling of dot is entirely independent of the handling of circum- - flex and dollar, the only relationship being that they both involve + The handling of dot is entirely independent of the handling of circum- + flex and dollar, the only relationship being that they both involve newlines. Dot has no special meaning in a character class. - The escape sequence \N behaves like a dot, except that it is not - affected by the PCRE_DOTALL option. In other words, it matches any - character except one that signifies the end of a line. Perl also uses + The escape sequence \N behaves like a dot, except that it is not + affected by the PCRE_DOTALL option. In other words, it matches any + character except one that signifies the end of a line. Perl also uses \N to match characters by name; PCRE does not support this. MATCHING A SINGLE DATA UNIT - Outside a character class, the escape sequence \C matches any one data - unit, whether or not a UTF mode is set. In the 8-bit library, one data - unit is one byte; in the 16-bit library it is a 16-bit unit; in the - 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches - line-ending characters. The feature is provided in Perl in order to + Outside a character class, the escape sequence \C matches any one data + unit, whether or not a UTF mode is set. In the 8-bit library, one data + unit is one byte; in the 16-bit library it is a 16-bit unit; in the + 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches + line-ending characters. The feature is provided in Perl in order to match individual bytes in UTF-8 mode, but it is unclear how it can use- - fully be used. Because \C breaks up characters into individual data - units, matching one unit with \C in a UTF mode means that the rest of + fully be used. Because \C breaks up characters into individual data + units, matching one unit with \C in a UTF mode means that the rest of the string may start with a malformed UTF character. This has undefined results, because PCRE assumes that it is dealing with valid UTF strings - (and by default it checks this at the start of processing unless the - PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or PCRE_NO_UTF32_CHECK option + (and by default it checks this at the start of processing unless the + PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or PCRE_NO_UTF32_CHECK option is used). - PCRE does not allow \C to appear in lookbehind assertions (described - below) in a UTF mode, because this would make it impossible to calcu- + PCRE does not allow \C to appear in lookbehind assertions (described + below) in a UTF mode, because this would make it impossible to calcu- late the length of the lookbehind. In general, the \C escape sequence is best avoided. However, one way of - using it that avoids the problem of malformed UTF characters is to use - a lookahead to check the length of the next character, as in this pat- - tern, which could be used with a UTF-8 string (ignore white space and + using it that avoids the problem of malformed UTF characters is to use + a lookahead to check the length of the next character, as in this pat- + tern, which could be used with a UTF-8 string (ignore white space and line breaks): (?| (?=[\x00-\x7f])(\C) | @@ -5736,11 +5737,11 @@ MATCHING A SINGLE DATA UNIT (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) | (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C)) - A group that starts with (?| resets the capturing parentheses numbers - in each alternative (see "Duplicate Subpattern Numbers" below). The - assertions at the start of each branch check the next UTF-8 character - for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The - character's individual bytes are then captured by the appropriate num- + A group that starts with (?| resets the capturing parentheses numbers + in each alternative (see "Duplicate Subpattern Numbers" below). The + assertions at the start of each branch check the next UTF-8 character + for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The + character's individual bytes are then captured by the appropriate num- ber of groups. @@ -5750,109 +5751,109 @@ SQUARE BRACKETS AND CHARACTER CLASSES closing square bracket. A closing square bracket on its own is not spe- cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set, a lone closing square bracket causes a compile-time error. If a closing - square bracket is required as a member of the class, it should be the - first data character in the class (after an initial circumflex, if + square bracket is required as a member of the class, it should be the + first data character in the class (after an initial circumflex, if present) or escaped with a backslash. - A character class matches a single character in the subject. In a UTF - mode, the character may be more than one data unit long. A matched + A character class matches a single character in the subject. In a UTF + mode, the character may be more than one data unit long. A matched character must be in the set of characters defined by the class, unless - the first character in the class definition is a circumflex, in which + the first character in the class definition is a circumflex, in which case the subject character must not be in the set defined by the class. - If a circumflex is actually required as a member of the class, ensure + If a circumflex is actually required as a member of the class, ensure it is not the first character, or escape it with a backslash. - For example, the character class [aeiou] matches any lower case vowel, - while [^aeiou] matches any character that is not a lower case vowel. + For example, the character class [aeiou] matches any lower case vowel, + while [^aeiou] matches any character that is not a lower case vowel. Note that a circumflex is just a convenient notation for specifying the - characters that are in the class by enumerating those that are not. A - class that starts with a circumflex is not an assertion; it still con- - sumes a character from the subject string, and therefore it fails if + characters that are in the class by enumerating those that are not. A + class that starts with a circumflex is not an assertion; it still con- + sumes a character from the subject string, and therefore it fails if the current pointer is at the end of the string. In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255 - (0xffff) can be included in a class as a literal string of data units, + (0xffff) can be included in a class as a literal string of data units, or by using the \x{ escaping mechanism. - When caseless matching is set, any letters in a class represent both - their upper case and lower case versions, so for example, a caseless - [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not - match "A", whereas a caseful version would. In a UTF mode, PCRE always - understands the concept of case for characters whose values are less - than 128, so caseless matching is always possible. For characters with - higher values, the concept of case is supported if PCRE is compiled - with Unicode property support, but not otherwise. If you want to use - caseless matching in a UTF mode for characters 128 and above, you must - ensure that PCRE is compiled with Unicode property support as well as + When caseless matching is set, any letters in a class represent both + their upper case and lower case versions, so for example, a caseless + [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not + match "A", whereas a caseful version would. In a UTF mode, PCRE always + understands the concept of case for characters whose values are less + than 128, so caseless matching is always possible. For characters with + higher values, the concept of case is supported if PCRE is compiled + with Unicode property support, but not otherwise. If you want to use + caseless matching in a UTF mode for characters 128 and above, you must + ensure that PCRE is compiled with Unicode property support as well as with UTF support. - Characters that might indicate line breaks are never treated in any - special way when matching character classes, whatever line-ending - sequence is in use, and whatever setting of the PCRE_DOTALL and + Characters that might indicate line breaks are never treated in any + special way when matching character classes, whatever line-ending + sequence is in use, and whatever setting of the PCRE_DOTALL and PCRE_MULTILINE options is used. A class such as [^a] always matches one of these characters. - The minus (hyphen) character can be used to specify a range of charac- - ters in a character class. For example, [d-m] matches any letter - between d and m, inclusive. If a minus character is required in a - class, it must be escaped with a backslash or appear in a position - where it cannot be interpreted as indicating a range, typically as the + The minus (hyphen) character can be used to specify a range of charac- + ters in a character class. For example, [d-m] matches any letter + between d and m, inclusive. If a minus character is required in a + class, it must be escaped with a backslash or appear in a position + where it cannot be interpreted as indicating a range, typically as the first or last character in the class, or immediately after a range. For - example, [b-d-z] matches letters in the range b to d, a hyphen charac- + example, [b-d-z] matches letters in the range b to d, a hyphen charac- ter, or z. It is not possible to have the literal character "]" as the end charac- - ter of a range. A pattern such as [W-]46] is interpreted as a class of - two characters ("W" and "-") followed by a literal string "46]", so it - would match "W46]" or "-46]". However, if the "]" is escaped with a - backslash it is interpreted as the end of range, so [W-\]46] is inter- - preted as a class containing a range followed by two other characters. - The octal or hexadecimal representation of "]" can also be used to end + ter of a range. A pattern such as [W-]46] is interpreted as a class of + two characters ("W" and "-") followed by a literal string "46]", so it + would match "W46]" or "-46]". However, if the "]" is escaped with a + backslash it is interpreted as the end of range, so [W-\]46] is inter- + preted as a class containing a range followed by two other characters. + The octal or hexadecimal representation of "]" can also be used to end a range. - An error is generated if a POSIX character class (see below) or an - escape sequence other than one that defines a single character appears - at a point where a range ending character is expected. For example, + An error is generated if a POSIX character class (see below) or an + escape sequence other than one that defines a single character appears + at a point where a range ending character is expected. For example, [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not. - Ranges operate in the collating sequence of character values. They can - also be used for characters specified numerically, for example - [\000-\037]. Ranges can include any characters that are valid for the + Ranges operate in the collating sequence of character values. They can + also be used for characters specified numerically, for example + [\000-\037]. Ranges can include any characters that are valid for the current mode. If a range that includes letters is used when caseless matching is set, it matches the letters in either case. For example, [W-c] is equivalent - to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if - character tables for a French locale are in use, [\xc8-\xcb] matches - accented E characters in both cases. In UTF modes, PCRE supports the - concept of case for characters with values greater than 128 only when + to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if + character tables for a French locale are in use, [\xc8-\xcb] matches + accented E characters in both cases. In UTF modes, PCRE supports the + concept of case for characters with values greater than 128 only when it is compiled with Unicode property support. - The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, + The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, \w, and \W may appear in a character class, and add the characters that - they match to the class. For example, [\dABCDEF] matches any hexadeci- - mal digit. In UTF modes, the PCRE_UCP option affects the meanings of - \d, \s, \w and their upper case partners, just as it does when they - appear outside a character class, as described in the section entitled + they match to the class. For example, [\dABCDEF] matches any hexadeci- + mal digit. In UTF modes, the PCRE_UCP option affects the meanings of + \d, \s, \w and their upper case partners, just as it does when they + appear outside a character class, as described in the section entitled "Generic character types" above. The escape sequence \b has a different - meaning inside a character class; it matches the backspace character. - The sequences \B, \N, \R, and \X are not special inside a character - class. Like any other unrecognized escape sequences, they are treated - as the literal characters "B", "N", "R", and "X" by default, but cause + meaning inside a character class; it matches the backspace character. + The sequences \B, \N, \R, and \X are not special inside a character + class. Like any other unrecognized escape sequences, they are treated + as the literal characters "B", "N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is set. - A circumflex can conveniently be used with the upper case character - types to specify a more restricted set of characters than the matching - lower case type. For example, the class [^\W_] matches any letter or + A circumflex can conveniently be used with the upper case character + types to specify a more restricted set of characters than the matching + lower case type. For example, the class [^\W_] matches any letter or digit, but not underscore, whereas [\w] includes underscore. A positive character class should be read as "something OR something OR ..." and a negative class as "NOT something AND NOT something AND NOT ...". - The only metacharacters that are recognized in character classes are - backslash, hyphen (only where it can be interpreted as specifying a - range), circumflex (only at the start), opening square bracket (only - when it can be interpreted as introducing a POSIX class name, or for a - special compatibility feature - see the next two sections), and the + The only metacharacters that are recognized in character classes are + backslash, hyphen (only where it can be interpreted as specifying a + range), circumflex (only at the start), opening square bracket (only + when it can be interpreted as introducing a POSIX class name, or for a + special compatibility feature - see the next two sections), and the terminating closing square bracket. However, escaping other non- alphanumeric characters does no harm. @@ -5860,7 +5861,7 @@ SQUARE BRACKETS AND CHARACTER CLASSES POSIX CHARACTER CLASSES Perl supports the POSIX notation for character classes. This uses names - enclosed by [: and :] within the enclosing square brackets. PCRE also + enclosed by [: and :] within the enclosing square brackets. PCRE also supports this notation. For example, [01[:alpha:]%] @@ -5883,28 +5884,28 @@ POSIX CHARACTER CLASSES word "word" characters (same as \w) xdigit hexadecimal digits - The default "space" characters are HT (9), LF (10), VT (11), FF (12), - CR (13), and space (32). If locale-specific matching is taking place, - the list of space characters may be different; there may be fewer or + The default "space" characters are HT (9), LF (10), VT (11), FF (12), + CR (13), and space (32). If locale-specific matching is taking place, + the list of space characters may be different; there may be fewer or more of them. "Space" used to be different to \s, which did not include VT, for Perl compatibility. However, Perl changed at release 5.18, and - PCRE followed at release 8.34. "Space" and \s now match the same set + PCRE followed at release 8.34. "Space" and \s now match the same set of characters. - The name "word" is a Perl extension, and "blank" is a GNU extension - from Perl 5.8. Another Perl extension is negation, which is indicated + The name "word" is a Perl extension, and "blank" is a GNU extension + from Perl 5.8. Another Perl extension is negation, which is indicated by a ^ character after the colon. For example, [12[:^digit:]] - matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the + matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not supported, and an error is given if they are encountered. By default, characters with values greater than 128 do not match any of - the POSIX character classes. However, if the PCRE_UCP option is passed - to pcre_compile(), some of the classes are changed so that Unicode - character properties are used. This is achieved by replacing certain + the POSIX character classes. However, if the PCRE_UCP option is passed + to pcre_compile(), some of the classes are changed so that Unicode + character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows: [:alnum:] becomes \p{Xan} @@ -5916,10 +5917,10 @@ POSIX CHARACTER CLASSES [:upper:] becomes \p{Lu} [:word:] becomes \p{Xwd} - Negated versions, such as [:^alpha:] use \P instead of \p. Three other + Negated versions, such as [:^alpha:] use \P instead of \p. Three other POSIX classes are handled specially in UCP mode: - [:graph:] This matches characters that have glyphs that mark the page + [:graph:] This matches characters that have glyphs that mark the page when printed. In Unicode property terms, it matches all char- acters with the L, M, N, P, S, or Cf properties, except for: @@ -5928,58 +5929,58 @@ POSIX CHARACTER CLASSES U+2066 - U+2069 Various "isolate"s - [:print:] This matches the same characters as [:graph:] plus space - characters that are not controls, that is, characters with + [:print:] This matches the same characters as [:graph:] plus space + characters that are not controls, that is, characters with the Zs property. [:punct:] This matches all characters that have the Unicode P (punctua- - tion) property, plus those characters whose code points are + tion) property, plus those characters whose code points are less than 128 that have the S (Symbol) property. - The other POSIX classes are unchanged, and match only characters with + The other POSIX classes are unchanged, and match only characters with code points less than 128. COMPATIBILITY FEATURE FOR WORD BOUNDARIES - In the POSIX.2 compliant library that was included in 4.4BSD Unix, the - ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word" + In the POSIX.2 compliant library that was included in 4.4BSD Unix, the + ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of word". PCRE treats these items as follows: [[:<:]] is converted to \b(?=\w) [[:>:]] is converted to \b(?<=\w) Only these exact character sequences are recognized. A sequence such as - [a[:<:]b] provokes error for an unrecognized POSIX class name. This - support is not compatible with Perl. It is provided to help migrations + [a[:<:]b] provokes error for an unrecognized POSIX class name. This + support is not compatible with Perl. It is provided to help migrations from other environments, and is best not used in any new patterns. Note - that \b matches at the start and the end of a word (see "Simple asser- - tions" above), and in a Perl-style pattern the preceding or following - character normally shows which is wanted, without the need for the - assertions that are used above in order to give exactly the POSIX be- + that \b matches at the start and the end of a word (see "Simple asser- + tions" above), and in a Perl-style pattern the preceding or following + character normally shows which is wanted, without the need for the + assertions that are used above in order to give exactly the POSIX be- haviour. VERTICAL BAR - Vertical bar characters are used to separate alternative patterns. For + Vertical bar characters are used to separate alternative patterns. For example, the pattern gilbert|sullivan - matches either "gilbert" or "sullivan". Any number of alternatives may - appear, and an empty alternative is permitted (matching the empty + matches either "gilbert" or "sullivan". Any number of alternatives may + appear, and an empty alternative is permitted (matching the empty string). The matching process tries each alternative in turn, from left - to right, and the first one that succeeds is used. If the alternatives - are within a subpattern (defined below), "succeeds" means matching the + to right, and the first one that succeeds is used. If the alternatives + are within a subpattern (defined below), "succeeds" means matching the rest of the main pattern as well as the alternative in the subpattern. INTERNAL OPTION SETTING - The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and - PCRE_EXTENDED options (which are Perl-compatible) can be changed from - within the pattern by a sequence of Perl option letters enclosed + The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and + PCRE_EXTENDED options (which are Perl-compatible) can be changed from + within the pattern by a sequence of Perl option letters enclosed between "(?" and ")". The option letters are i for PCRE_CASELESS @@ -5989,51 +5990,47 @@ INTERNAL OPTION SETTING For example, (?im) sets caseless, multiline matching. It is also possi- ble to unset these options by preceding the letter with a hyphen, and a - combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- - LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, - is also permitted. If a letter appears both before and after the + combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- + LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, + is also permitted. If a letter appears both before and after the hyphen, the option is unset. - The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA - can be changed in the same way as the Perl-compatible options by using + The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA + can be changed in the same way as the Perl-compatible options by using the characters J, U and X respectively. - When one of these option changes occurs at top level (that is, not - inside subpattern parentheses), the change applies to the remainder of - the pattern that follows. If the change is placed right at the start of - a pattern, PCRE extracts it into the global options (and it will there- - fore show up in data extracted by the pcre_fullinfo() function). - - An option change within a subpattern (see below for a description of - subpatterns) affects only that part of the subpattern that follows it, - so + When one of these option changes occurs at top level (that is, not + inside subpattern parentheses), the change applies to the remainder of + the pattern that follows. An option change within a subpattern (see + below for a description of subpatterns) affects only that part of the + subpattern that follows it, so (a(?i)b)c matches abc and aBc and no other strings (assuming PCRE_CASELESS is not - used). By this means, options can be made to have different settings - in different parts of the pattern. Any changes made in one alternative - do carry on into subsequent branches within the same subpattern. For + used). By this means, options can be made to have different settings + in different parts of the pattern. Any changes made in one alternative + do carry on into subsequent branches within the same subpattern. For example, (a(?i)b|c) - matches "ab", "aB", "c", and "C", even though when matching "C" the - first branch is abandoned before the option setting. This is because - the effects of option settings happen at compile time. There would be + matches "ab", "aB", "c", and "C", even though when matching "C" the + first branch is abandoned before the option setting. This is because + the effects of option settings happen at compile time. There would be some very weird behaviour otherwise. - Note: There are other PCRE-specific options that can be set by the - application when the compiling or matching functions are called. In - some cases the pattern can contain special leading sequences such as - (*CRLF) to override what the application has set or what has been - defaulted. Details are given in the section entitled "Newline - sequences" above. There are also the (*UTF8), (*UTF16),(*UTF32), and - (*UCP) leading sequences that can be used to set UTF and Unicode prop- - erty modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16, - PCRE_UTF32 and the PCRE_UCP options, respectively. The (*UTF) sequence - is a generic version that can be used with any of the libraries. How- - ever, the application can set the PCRE_NEVER_UTF option, which locks + Note: There are other PCRE-specific options that can be set by the + application when the compiling or matching functions are called. In + some cases the pattern can contain special leading sequences such as + (*CRLF) to override what the application has set or what has been + defaulted. Details are given in the section entitled "Newline + sequences" above. There are also the (*UTF8), (*UTF16),(*UTF32), and + (*UCP) leading sequences that can be used to set UTF and Unicode prop- + erty modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16, + PCRE_UTF32 and the PCRE_UCP options, respectively. The (*UTF) sequence + is a generic version that can be used with any of the libraries. How- + ever, the application can set the PCRE_NEVER_UTF option, which locks out the use of the (*UTF) sequences. @@ -6046,18 +6043,18 @@ SUBPATTERNS cat(aract|erpillar|) - matches "cataract", "caterpillar", or "cat". Without the parentheses, + matches "cataract", "caterpillar", or "cat". Without the parentheses, it would match "cataract", "erpillar" or an empty string. - 2. It sets up the subpattern as a capturing subpattern. This means - that, when the whole pattern matches, that portion of the subject + 2. It sets up the subpattern as a capturing subpattern. This means + that, when the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller via the - ovector argument of the matching function. (This applies only to the - traditional matching functions; the DFA matching functions do not sup- + ovector argument of the matching function. (This applies only to the + traditional matching functions; the DFA matching functions do not sup- port capturing.) Opening parentheses are counted from left to right (starting from 1) to - obtain numbers for the capturing subpatterns. For example, if the + obtain numbers for the capturing subpatterns. For example, if the string "the red king" is matched against the pattern the ((red|white) (king|queen)) @@ -6065,12 +6062,12 @@ SUBPATTERNS the captured substrings are "red king", "red", and "king", and are num- bered 1, 2, and 3, respectively. - The fact that plain parentheses fulfil two functions is not always - helpful. There are often times when a grouping subpattern is required - without a capturing requirement. If an opening parenthesis is followed - by a question mark and a colon, the subpattern does not do any captur- - ing, and is not counted when computing the number of any subsequent - capturing subpatterns. For example, if the string "the white queen" is + The fact that plain parentheses fulfil two functions is not always + helpful. There are often times when a grouping subpattern is required + without a capturing requirement. If an opening parenthesis is followed + by a question mark and a colon, the subpattern does not do any captur- + ing, and is not counted when computing the number of any subsequent + capturing subpatterns. For example, if the string "the white queen" is matched against the pattern the ((?:red|white) (king|queen)) @@ -6078,37 +6075,37 @@ SUBPATTERNS the captured substrings are "white queen" and "queen", and are numbered 1 and 2. The maximum number of capturing subpatterns is 65535. - As a convenient shorthand, if any option settings are required at the - start of a non-capturing subpattern, the option letters may appear + As a convenient shorthand, if any option settings are required at the + start of a non-capturing subpattern, the option letters may appear between the "?" and the ":". Thus the two patterns (?i:saturday|sunday) (?:(?i)saturday|sunday) match exactly the same set of strings. Because alternative branches are - tried from left to right, and options are not reset until the end of - the subpattern is reached, an option setting in one branch does affect - subsequent branches, so the above patterns match "SUNDAY" as well as + tried from left to right, and options are not reset until the end of + the subpattern is reached, an option setting in one branch does affect + subsequent branches, so the above patterns match "SUNDAY" as well as "Saturday". DUPLICATE SUBPATTERN NUMBERS Perl 5.10 introduced a feature whereby each alternative in a subpattern - uses the same numbers for its capturing parentheses. Such a subpattern - starts with (?| and is itself a non-capturing subpattern. For example, + uses the same numbers for its capturing parentheses. Such a subpattern + starts with (?| and is itself a non-capturing subpattern. For example, consider this pattern: (?|(Sat)ur|(Sun))day - Because the two alternatives are inside a (?| group, both sets of cap- - turing parentheses are numbered one. Thus, when the pattern matches, - you can look at captured substring number one, whichever alternative - matched. This construct is useful when you want to capture part, but + Because the two alternatives are inside a (?| group, both sets of cap- + turing parentheses are numbered one. Thus, when the pattern matches, + you can look at captured substring number one, whichever alternative + matched. This construct is useful when you want to capture part, but not all, of one of a number of alternatives. Inside a (?| group, paren- - theses are numbered as usual, but the number is reset at the start of - each branch. The numbers of any capturing parentheses that follow the - subpattern start after the highest number used in any branch. The fol- + theses are numbered as usual, but the number is reset at the start of + each branch. The numbers of any capturing parentheses that follow the + subpattern start after the highest number used in any branch. The fol- lowing example is taken from the Perl documentation. The numbers under- neath show in which buffer the captured content will be stored. @@ -6116,58 +6113,58 @@ DUPLICATE SUBPATTERN NUMBERS / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x # 1 2 2 3 2 3 4 - A back reference to a numbered subpattern uses the most recent value - that is set for that number by any subpattern. The following pattern + A back reference to a numbered subpattern uses the most recent value
[... 2203 lines stripped ...] --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org