Re: Build from git broken - missing gperf?

Gavin Smith Wed, 07 Feb 2024 14:51:53 -0800

On Tue, Feb 06, 2024 at 07:13:09PM +0100, Patrice Dumas wrote:
> On Mon, Feb 05, 2024 at 07:35:59PM +0000, Gavin Smith wrote:
> > I don't know if uniconv/u8-conv-from-enc is a necessary module.  It's
> > not easy to find out how the module is used as the documentation is
> > lacking, but it appears to match libunistring.  The documentation is
> > here:
> > https://www.gnu.org/software/libunistring/manual/html_node/uniconv_002eh.html
> > 
> > I found uses of "u8_strconv_from_encoding" throughout the XS code,
> > although most of the uses (I didn't check them all) have "UTF-8" as one
> > of the arguments, making it appear that we are converting from UTF-8
> > to UTF-8.
> 
> It is the case.  We actually already discussed that issue peviously, in
> the codes I did, and in order to follow what I understood from the
> libunistring documentation, char * is converted to uint8_t by calling
> u8_strconv_from_encoding even though the string is already UTF-8.  In
> your code in xspara.c you simply cast to uint8_t.  It could also be done
> like that in other codes, I do not know what is best.

The immediate solution is to require gperf as a tool for developers, just
like automake, autoconf, etc.

Getting away from u8_strconv_from_encoding could take some more effort
and isn't immediately necessary, but would be nice to do to reduce bloat.
Since we only use it for UTF-8 validation, we could do this in some other
function that is simpler and doesn't pull in as much from gnulib.

I saw your private email from November 2023. Here's part of what
I wrote in my response (for the benefit of the mailing list):

We can assume the text strings coming out of Perl are encoded already
in UTF-8, so running a conversion on them is pointless and confusing.

According to the libunistring manual:

The five types char *, uint8_t *, uint16_t *, uint32_t *, and wchar_t
* are incompatible types at the C level. Therefore, ‘gcc -Wall’
will produce a warning if, by mistake, your code contains a mismatch
between these types. In the context of using GNU libunistring, even
a warning about a mismatch between char * and uint8_t * is a sign of
a bug in your code that you should not try to silence through a cast.

https://www.gnu.org/software/libunistring/manual/libunistring.html#In_002dmemory-representation

However, I don't understand how this can possibly be avoided, other than
by running pointless conversions. SvPV, which we use in XSParagraph.xs
to get the pointer, returns a char * value. Unless the Perl API can
give a value with a type of uint8_t * to represent a UTF-8 string,
then we can only avoid such warnings with a cast.

I can see the appeal of not fully trusting Perl's API to provide correct
values for use in our own XS code. I suggest that if we do use a cast
we can do it in one single place in the code along with any validation
we do on the UTF-8. We could start with a wrapper around
u8_strconv_from_encoding. I'm happy to work on this myself when I have
time to.

> That being said, we also directly use gnulib iconv, so I think that
> iconv_open would still be brought in anyway.

We'd have to see if this module was still worth using for the platforms
it supports and the problems it solves.

Re: Build from git broken - missing gperf?

Reply via email to