ID: 47480 User updated by: sehh at ionos dot gr Reported By: sehh at ionos dot gr Status: Open Bug Type: PCRE related Operating System: Linux PHP Version: 5.2.8 New Comment:
Do you think it would be better if I contacted the developers of the PCRE library at http://www.pcre.org/ ? Maybe submitting a patch or bug report to them would cover a lot more open source projects, instead of patching the PCRE library used by php only. Previous Comments: ------------------------------------------------------------------------ [2009-03-09 17:20:56] mmcnickle at gmail dot com It wouldn't be impossible, no. But to someone without detailed knowledge of Greek it would be. The unicode.org article on regular expressions [1] has this to say: "All of the above deals with a default specification for a regular expression. However, a regular expression engine also may want to support tailored specifications, typically tailored for a particular language or locale. This may be important when the regular expression engine is being used by end-users instead of programmers, such as in a word-processor allowing some level of regular expressions in searching." Earlier in the document it says about how basic regex engines are only required to include the basic unicode uppercase/lowercase matching. Looking though the source code of the PRCE library, it does seem possible to generate locale-specific character tables; this may be an avenue to look into. Perhaps the best thing to do would be to drop a message in the internationalization mailing list (http://marc.info/?l=php-i18n) and see what they have to say. [1] http://unicode.org/reports/tr18/#Tailored_Support ------------------------------------------------------------------------ [2009-03-09 16:01:59] sehh at ionos dot gr Indeed thats far from ideal, its impossible from my development point of view to re-write every single accented character with its possible equivalent for the entire string, for every string in the regex. For example, this: /Âáëâßäåò åéóáãùãÞò-åîáãùãÞò/i Would become a monster like this: /Âáëâ[É|ß|º]ä[Å|å|¸]ò åéóáãùã[Ç|Þ|¹]ò-åîáãùã[Ç|Þ|¹]ò/i We would need a regex to create the regex! or at least a text search/replace method in PHP. Are you sure its impossible to add a few exceptions within the PCRE library? ------------------------------------------------------------------------ [2009-03-09 15:25:51] mmcnickle at gmail dot com Yes, unfortunately trying to include locale and language specific cases is next to impossible for regular expression engine developers. The best that can be done, though far from ideal, is for the user to try to take these changes into account when they are crafting the regex: $target1 = "ÊÉÍÇÔ[Ç|Þ]ÑÁ"; // Greek; $target1 = "Stra[ss|ß]ebahn" // German ------------------------------------------------------------------------ [2009-03-09 15:00:25] sehh at ionos dot gr I forgot the capital accented characters, so the above should read: "Ç" == "Þ" == "ç" == "¹" "Á" == "Ü" == "á" == "¶" etc.. Remember that in Greek, the accent may be omitted from capital letters or may be included for the first letter only. So that should produce proper case-insensitive results. ------------------------------------------------------------------------ [2009-03-09 14:54:32] sehh at ionos dot gr The PCRE library is wrong then. "Ç" is correctly defined in Unicode as "ç", but the library should also understand the meaning of "Ç" == "Þ" == "ç". This counts for all Greek accents: "Á" == "Ü" == "á" etc... Otherwise, the parameter "/i" is useless for the Greek language and thats why the current implementation does not work for Greek. Thank you for taking the time to look into this issue, much appreciated. ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at http://bugs.php.net/47480 -- Edit this bug report at http://bugs.php.net/?id=47480&edit=1