On Mon, Oct 23, 2023 at 04:41:17PM +0300, Eli Zaretskii wrote: > Bingo. This brings the time for producing the ELisp manual down to > 15.4 sec, 5 sec faster than v7.0.3. > > I see that btowc linked into the XSParagraph module is a MinGW > specific implementation, not from the Windows-standard MSVCRT (where > it is absent). My conclusion is that the MinGW btowc is extremely > inefficient.
Great. Hopefully it helps you to be more productive on working on documentation. I propose the following, more finished patch, which applies to Texinfo 7.1. We can also do something similar for the master branch. I am making a release/7.1 branch for this fix, so it will be included if there is ever a bug fix release 7.1.1. However, I am not going to make a bug fix release with just this change in it. https://git.savannah.gnu.org/cgit/texinfo.git/commit/?h=release/7.1&id=c76bcd0feed005aaf9db28a76f4883f3ae98295b If we do move away from locale-based character processing in the paragraph formatter, then we will not be using mbrtowc or btowc in the future. diff --git a/ChangeLog b/ChangeLog index e619109f5b..c4379ec56b 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,10 @@ +2023-10-23 Gavin Smith <gavinsmith0...@gmail.com> + + * tp/Texinfo/XS/xspara.c (get_utf8_codepoint): + Wrapper for mbrtowc/btowc. + [_WIN32]: Do not call btowc, as it was tested to be very slow + on MinGW. Report from Eli Zaretskii. + 2023-10-18 Gavin Smith <gavinsmith0...@gmail.com> Texinfo 7.1 diff --git a/tp/Texinfo/XS/xspara.c b/tp/Texinfo/XS/xspara.c index 7c6895a7ff..e1cddcdc2a 100644 --- a/tp/Texinfo/XS/xspara.c +++ b/tp/Texinfo/XS/xspara.c @@ -684,6 +684,30 @@ xspara_end (void) /* characters triggering an end of sentence */ #define end_sentence_characters ".?!" +/* Wrapper for mbrtowc. Set *PWC and return length of codepoint in bytes. */ +size_t +get_utf8_codepoint (wchar_t *pwc, const char *mbs, size_t n) +{ +#ifdef _WIN32 + /* Use the above implementation of mbrtowc. Do not use btowc as + does not exist as standard on MS-Windows, and was tested to be + very slow on MinGW. */ + return mbrtowc (pwc, mbs, n, NULL); +#else + if (!PRINTABLE_ASCII(*mbs)) + { + return mbrtowc (pwc, mbs, n, NULL); + } + else + { + /* Functionally the same as mbrtowc but (tested) slightly quicker. */ + *pwc = btowc (*mbs); + return 1; + } +#endif +} + + /* Add WORD to paragraph in RESULT, not refilling WORD. If we go past the end of the line start a new one. TRANSPARENT means that the letters in WORD are ignored for the purpose of deciding whether a full stop ends a sentence @@ -730,18 +754,7 @@ xspara__add_next (TEXT *result, char *word, int word_len, int transparent) if (!strchr (end_sentence_characters after_punctuation_characters, *p)) { - if (!PRINTABLE_ASCII(*p)) - { - wchar_t wc = L'\0'; - mbrtowc (&wc, p, len, NULL); - state.last_letter = wc; - break; - } - else - { - state.last_letter = btowc (*p); - break; - } + get_utf8_codepoint (&state.last_letter, p, len); } } } @@ -1013,16 +1026,7 @@ xspara_add_text (char *text, int len) } /************** Not a white space character. *****************/ - if (!PRINTABLE_ASCII(*p)) - { - char_len = mbrtowc (&wc, p, len, NULL); - } - else - { - /* Functonally the same as mbrtowc but (tested) slightly quicker. */ - char_len = 1; - wc = btowc (*p); - } + char_len = get_utf8_codepoint (&wc, p, len); if ((long) char_len == 0) break; /* Null character. Shouldn't happen. */