On Tue, Jul 4, 2023 at 11:50 AM Thomas Schwinge <tho...@codesourcery.com> wrote: > > Hi! > > I came across this one here on my way working through another (somewhat > related) GTY issue. I generally do understand the issue here, but do > have a question about 'unsigned int len' field in > 'libcpp/include/symtab.h:struct ht_identifier': > > On 2022-10-18T18:14:54-0400, Lewis Hyatt via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: > > When a GTY'ed struct is streamed to PCH, any plain char* pointers it > > contains > > (whether they live in GC-controlled memory or not) will be marked for PCH > > output by the routine gt_pch_note_object in ggc-common.cc. This routine > > special-cases plain char* strings, and in particular it uses strlen() to get > > their length. > > Oh, wow, this special casing for strings... 8-| > > > Thus it does not handle strings with embedded null bytes, but it > > is possible for something PCH cares about (such as a string literal token > > in a > > macro definition) to contain such embedded nulls. To fix that up, add a new > > GTY option "string_length" so that gt_pch_note_object can be informed the > > actual length it ought to use, and use it in the relevant libcpp structs > > (cpp_string and ht_identifier) accordingly. > > For your test case: > > > --- /dev/null > > +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.C > > @@ -0,0 +1,3 @@ > > +// { dg-do compile { target c++11 } } > > +#include "pch-string-nulls.H" > > +static_assert (X[4] == '[' && X[5] == '!' && X[6] == ']', "error"); > > > --- /dev/null > > +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.Hs > > @@ -0,0 +1,2 @@ > > +/* Note that there is a null byte following "ABC". */ > > +#define X R"(ABC^@[!])" > > ..., I understand how the following is necessary: > > > --- a/libcpp/include/cpplib.h > > +++ b/libcpp/include/cpplib.h > > @@ -179,7 +179,11 @@ enum c_lang {CLK_GNUC89 = 0, CLK_GNUC99, CLK_GNUC11, > > CLK_GNUC17, CLK_GNUC2X, > > /* Payload of a NUMBER, STRING, CHAR or COMMENT token. */ > > struct GTY(()) cpp_string { > > unsigned int len; > > - const unsigned char *text; > > + > > + /* TEXT is always null terminated (terminator not included in len); but > > this > > + GTY markup arranges that PCH streaming works properly even if there > > is a > > + null byte in the middle of the string. */ > > + const unsigned char * GTY((string_length ("1 + %h.len"))) text; > > }; > > (That is, the test case FAILs with that one reverted.) > > However, this one did confuse me: > > > --- a/libcpp/include/symtab.h > > +++ b/libcpp/include/symtab.h > > @@ -29,7 +29,10 @@ along with this program; see the file COPYING3. If not > > see > > typedef struct ht_identifier ht_identifier; > > typedef struct ht_identifier *ht_identifier_ptr; > > struct GTY(()) ht_identifier { > > - const unsigned char *str; > > + /* This GTY markup arranges that the null-terminated identifier would > > still > > + stream to PCH correctly, if a null byte were to make its way into an > > + identifier somehow. */ > > + const unsigned char * GTY((string_length ("1 + %h.len"))) str; > > unsigned int len; > > unsigned int hash_value; > > }; > > I did wonder whether that's an actual or just a theoretical concern, to > have 'ht_identifier's with embedded NULs? If an actual concern, can we > get a test case constructed? Otherwise, should we revert this hunk, > given that we have this auto-'strlen' handling, ignorant of embedded > NULs, in a lot of other places? > > But then I realized that possibly we do maintain 'len' here not for > correctness but as an optimization (trading an 'unsigned int' for > repeated 'strlen' calls)? My quick testing with the attached > "[RFC] Verify no embedded NULs in 'struct ht_identifier'" might confirm > this theory: no regressions (..., but I didn't bootstrap, and ran only > parts of the testsuite). (Not proposing that RFC for 'git push', of > course.) > > If that's indeed the intention here, I shall change/add source code > commentary to describe this rationale for this dedicated 'len' field > (plus, that handling of embedded NULs falls out of that, automatically). > > For reference, this 'len' field has existed "forever". Before > 'struct ht_identifier' was added in > commit 2a967f3d3a45294640e155381ef549e0b8090ad4 (Subversion r42334), we > had in 'gcc/cpplib.h:struct cpp_hashnode': 'unsigned short len', or > earlier 'length', earlier in 'gcc/cpphash.h:struct hashnode': > 'unsigned short length', earlier 'size_t length' with comment: > "length of token, for quick comparison", erlier 'int length', ever since > the 'gcc/cpp*' files were added in > commit 7f2935c734c36f84ab62b20a04de465e19061333 (Subversion r9191). >
I don't think there is currently any possibility for a null byte to end up in an ht_identifier's string. I assumed that ht_identifier stores the length as an optimization (especially since it doesn't take up any extra space on 64-bit platforms, given the 32-bit hash code is stored as well there.) I created the string_length GTY markup mainly to support another patch that I have still pending review, which I thought would increase the likelihood of PCH needing to handle null bytes in general. When I did that, I added the markup to ht_identifier simply because the length was already there, so there was no reason not to add it. It does save a few cycles when streaming out the PCH, but I doubt it is meaningful. -Lewis