From: kobrien at kiva dot org Operating system: Ubuntu 12.04 PHP version: 5.5.0alpha1 Package: *Unicode Issues Bug Type: Feature/Change Request Bug description:Create a mb_str_word_count() function which is multi-byte aware
Description: ------------ Create a mb_str_word_count() function which will properly handle counting the number of words in string that contains multi-byte characters. This is currently not possible with str_word_count() because of use of the isalpha() C function which does not properly handle multi-byte characters. As suggested by aharvey, this new function would replace usage of isalpha() with iswalpha(). A naive (meaning no real knowledge of this or testing of it) patch would look like: diff --git a/ext/standard/string.c b/ext/standard/string.c index 7a4ae2e..9ab6b5f 100644 --- a/ext/standard/string.c +++ b/ext/standard/string.c @@ -5202,7 +5202,7 @@ PHP_FUNCTION(str_word_count) while (p < e) { s = p; - while (p < e && (isalpha((unsigned char)*p) || (char_list && ch[(unsigned char)*p]) || *p == '\'' || *p == '-')) { + while (p < e && (iswalpha((unsigned char)*p) || (char_list && ch[(unsigned char)*p]) || *p == '\'' || *p == '-')) { p++; } if (p > s) { Test script: --------------- <?php // existing str_word_count function for comparison print str_word_count("PHP function str_word_count does not properly handle non-latin characters") . "\n"; // returns 11 print str_word_count("Хабилло жиÑÐµÐ»Ñ Ð¯Ð²Ð°Ð½Ñкого Ñайона. ÐÐ¼Ñ 70 леÑ. Ðн женаÑ. У него ÑеÑвеÑо деÑей. Хабилло Ñилолог. Ðн более двадÑаÑи Ð»ÐµÑ ÑабоÑÐ°ÐµÑ Ð¿Ð¾ пÑоÑеÑÑии. Также Хабилло занимаеÑÑÑ Ð²Ð¸Ð½Ð¾Ð³ÑадаÑÑÑвом. У него имееÑÑÑ Ð½ÐµÐ±Ð¾Ð»ÑÑой виногÑадник. ÐÑим видом деÑÑелÑноÑÑи Хабилло занимаеÑÑÑ 15 леÑ."); // returns 0 // new function mb_str_word_count print mb_str_word_count("Хабилло жиÑÐµÐ»Ñ Ð¯Ð²Ð°Ð½Ñкого Ñайона. ÐÐ¼Ñ 70 леÑ. Ðн женаÑ. У него ÑеÑвеÑо деÑей. Хабилло Ñилолог. Ðн более двадÑаÑи Ð»ÐµÑ ÑабоÑÐ°ÐµÑ Ð¿Ð¾ пÑоÑеÑÑии. Также Хабилло занимаеÑÑÑ Ð²Ð¸Ð½Ð¾Ð³ÑадаÑÑÑвом. У него имееÑÑÑ Ð½ÐµÐ±Ð¾Ð»ÑÑой виногÑадник. ÐÑим видом деÑÑелÑноÑÑи Хабилло занимаеÑÑÑ 15 леÑ."); // returns 37 Expected result: ---------------- Using mb_str_word_count() will return the number of words in a string containing multibyte characters Actual result: -------------- Currently there is no mb_str_word_count() function. Using str_word_count() on a string with multibyte characters returns 0. -- Edit bug report at https://bugs.php.net/bug.php?id=63671&edit=1 -- Try a snapshot (PHP 5.4): https://bugs.php.net/fix.php?id=63671&r=trysnapshot54 Try a snapshot (PHP 5.3): https://bugs.php.net/fix.php?id=63671&r=trysnapshot53 Try a snapshot (trunk): https://bugs.php.net/fix.php?id=63671&r=trysnapshottrunk Fixed in SVN: https://bugs.php.net/fix.php?id=63671&r=fixed Fixed in release: https://bugs.php.net/fix.php?id=63671&r=alreadyfixed Need backtrace: https://bugs.php.net/fix.php?id=63671&r=needtrace Need Reproduce Script: https://bugs.php.net/fix.php?id=63671&r=needscript Try newer version: https://bugs.php.net/fix.php?id=63671&r=oldversion Not developer issue: https://bugs.php.net/fix.php?id=63671&r=support Expected behavior: https://bugs.php.net/fix.php?id=63671&r=notwrong Not enough info: https://bugs.php.net/fix.php?id=63671&r=notenoughinfo Submitted twice: https://bugs.php.net/fix.php?id=63671&r=submittedtwice register_globals: https://bugs.php.net/fix.php?id=63671&r=globals PHP 4 support discontinued: https://bugs.php.net/fix.php?id=63671&r=php4 Daylight Savings: https://bugs.php.net/fix.php?id=63671&r=dst IIS Stability: https://bugs.php.net/fix.php?id=63671&r=isapi Install GNU Sed: https://bugs.php.net/fix.php?id=63671&r=gnused Floating point limitations: https://bugs.php.net/fix.php?id=63671&r=float No Zend Extensions: https://bugs.php.net/fix.php?id=63671&r=nozend MySQL Configuration Error: https://bugs.php.net/fix.php?id=63671&r=mysqlcfg