On Tue, Feb 06, 2024 at 07:13:09PM +0100, Patrice Dumas wrote: > On Mon, Feb 05, 2024 at 07:35:59PM +0000, Gavin Smith wrote: > > I don't know if uniconv/u8-conv-from-enc is a necessary module. It's > > not easy to find out how the module is used as the documentation is > > lacking, but it appears to match libunistring. The documentation is > > here: > > https://www.gnu.org/software/libunistring/manual/html_node/uniconv_002eh.html > > > > I found uses of "u8_strconv_from_encoding" throughout the XS code, > > although most of the uses (I didn't check them all) have "UTF-8" as one > > of the arguments, making it appear that we are converting from UTF-8 > > to UTF-8. > > It is the case. We actually already discussed that issue peviously, in > the codes I did, and in order to follow what I understood from the > libunistring documentation, char * is converted to uint8_t by calling > u8_strconv_from_encoding even though the string is already UTF-8. In > your code in xspara.c you simply cast to uint8_t. It could also be done > like that in other codes, I do not know what is best.
The immediate solution is to require gperf as a tool for developers, just like automake, autoconf, etc. Getting away from u8_strconv_from_encoding could take some more effort and isn't immediately necessary, but would be nice to do to reduce bloat. Since we only use it for UTF-8 validation, we could do this in some other function that is simpler and doesn't pull in as much from gnulib. I saw your private email from November 2023. Here's part of what I wrote in my response (for the benefit of the mailing list): We can assume the text strings coming out of Perl are encoded already in UTF-8, so running a conversion on them is pointless and confusing. According to the libunistring manual: The five types char *, uint8_t *, uint16_t *, uint32_t *, and wchar_t * are incompatible types at the C level. Therefore, ‘gcc -Wall’ will produce a warning if, by mistake, your code contains a mismatch between these types. In the context of using GNU libunistring, even a warning about a mismatch between char * and uint8_t * is a sign of a bug in your code that you should not try to silence through a cast. https://www.gnu.org/software/libunistring/manual/libunistring.html#In_002dmemory-representation However, I don't understand how this can possibly be avoided, other than by running pointless conversions. SvPV, which we use in XSParagraph.xs to get the pointer, returns a char * value. Unless the Perl API can give a value with a type of uint8_t * to represent a UTF-8 string, then we can only avoid such warnings with a cast. I can see the appeal of not fully trusting Perl's API to provide correct values for use in our own XS code. I suggest that if we do use a cast we can do it in one single place in the code along with any validation we do on the UTF-8. We could start with a wrapper around u8_strconv_from_encoding. I'm happy to work on this myself when I have time to. > That being said, we also directly use gnulib iconv, so I think that > iconv_open would still be brought in anyway. We'd have to see if this module was still worth using for the platforms it supports and the problems it solves.