On 08/21/18 10:33, Richard Biener wrote: > On Mon, 20 Aug 2018, Bernd Edlinger wrote: > >> On 08/20/18 15:19, Richard Biener wrote: >>> On Mon, 20 Aug 2018, Bernd Edlinger wrote: >>> >>>> On 08/20/18 13:01, Richard Biener wrote: >>>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlin...@hotmail.de> >>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 08/01/18 11:29, Richard Biener wrote: >>>>>>> >>>>>>> Hmm. I think it would be nice if TREE_STRING_LENGTH would >>>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient >>>>>>> for your check above. Because the '\0' doesn't belong to the >>>>>>> string. Then build_string internally appends a '\0' outside >>>>>>> of TREE_STRING_LENGTH. >>>>>>> >>>>>> >>>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide >>>>>> character. >>>>> >>>>> That could be fixed though (a wide 0 is just N 0s). Add a elsz = 1 >>>>> parameter to build_string and allocate as many extra 0s as needed. >>>>> >>>>> There are STRING_CSTs which are not string literals, >>>>>> for instance attribute tags, Pragmas, asm constrants, etc. >>>>>> They use the '\0' outside, and have probably no TREE_TYPE. >>>>>> >>>>>>> >>>>>>>> So I would like to be able to assume that the STRING_CST objects >>>>>>>> are internally always generated properly by the front end. >>>>>>> >>>>>>> Yeah, I guess we need to define what "properly" is ;) >>>>>>> >>>>>> Yes. >>>>>> >>>>>>>> And that the ARRAY_TYPE of the string literal either has the >>>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter, >>>>>>>> this is always exactly one (wide) character size less than >>>>>>>> TREE_STRING_LENGTH >>>>>>> >>>>>>> I think it should be always the same... >>>>>>> >>>>>> >>>>>> One could not differentiate between "\0" without zero-termination >>>>>> and "" with zero-termination, theoretically. >>>>> >>>>> Is that important? Doesn't the C standard say how to parse string >>>>> literals? >>>>> >>>>>> We also have char x[100] = "ab"; >>>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100. >>>>>> Of course one could create it with a TREE_STRING_LENGTH = 100, >>>>>> but imagine char x[100000000000] = "ab" >>>>> >>>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I >>>>> hope matches "ab" and not 'x'. If it matches 'x' then I'd rather have it >>>>> unconditionally be [], thus incomplete (because the literals "size" >>>>> depends >>>>> on the context/LHS it is used on). >>>>> >>>> >>>> Sorry, but I must say, it is not at all like that. >>>> >>>> If I compile x.c: >>>> const char x[100] = "ab"; >>>> >>>> and set a breakpoint at output_constant: >>>> >>>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256, >>>> reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821 >>>> 4821 if (size == 0 || flag_syntax_only) >>>> (gdb) p size >>>> $1 = 100 >>>> (gdb) call debug(exp) >>>> "ab" >>>> (gdb) p *exp >>>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1, >>>> (gdb) p exp->typed.type->type_common.size_unit >>>> $5 = (tree) 0x7ffff6ff9d80 >>>> (gdb) call debug(exp->typed.type->type_common.size_unit) >>>> 100 >>>> (gdb) p exp->string.length >>>> $6 = 3 >>>> (gdb) p exp->string.str[0] >>>> $8 = 97 'a' >>>> (gdb) p exp->string.str[1] >>>> $9 = 98 'b' >>>> (gdb) p exp->string.str[2] >>>> $10 = 0 '\000' >>>> (gdb) p exp->string.str[3] >>>> $11 = 0 '\000' >>>> >>>> >>>> This is an important property of string_cst objects, that is used in >>>> c_strlen: >>>> >>>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond >>>> TREE_STRING_LENGTH >>>> is guaranteed to be zero up to the type size. >>>> >>>> I would not have spent one thought on implementing an optimization like >>>> that, >>>> but that's how it is right now. >>> >>> Huh. So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka >>> they have zero-padding up to its type size. I don't see this documented >>> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0" >>> with appropriate TYPE_SIZE. >>> >>> This is also a relatively new thing on trunk (coming out of the added >>> mem_size parameter of string_constant). That this looks at the STRING_CST >>> type like >>> >>> if (TREE_CODE (array) == STRING_CST) >>> { >>> *ptr_offset = fold_convert (sizetype, offset); >>> if (mem_size) >>> *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); >>> return array; >>> >>> I'd call simply a bug. As said, interpretation of STRING_CSTs should >>> depend on the context. And for >>> >> >> This use of the TYPE_SIZE_UNIT was pre-existing to r263607 >> before that (but not in the gcc-8-branch) we had this in c_strlen: >> >> HOST_WIDE_INT maxelts = strelts; >> tree type = TREE_TYPE (src); >> if (tree size = TYPE_SIZE_UNIT (type)) >> if (tree_fits_shwi_p (size)) >> { >> maxelts = tree_to_uhwi (size); >> maxelts = maxelts / eltsize - 1; >> } >> >> which I moved to string_constant and return the result through memsize. >> >> It seems to be working somehow, and I'd bet removing that will immediately >> break twenty or thirty of the strlenopt tests. >> >> Obviously the tree string objects allow way too much variations, >> and it has to be restricted in some way or another. >> >> In the moment my approach is: look at what most string constants do >> and add assertions to ensure that there are no exceptions. >> >> >>> char a[4] = "abc"; >>> char b[5] = "abc"; >>> >>> we should better be able to share STRING_CSTs [you can see LTO >>> sharing the nodes if you do b[4] but not when b[5] I suppose]. >>> >>>> All I want to do, is make sure that all string constants have the same >>>> look and feel >>>> in the middle-end, and restrict the variations that are allowed by the >>>> current >>>> implementation. >>> >>> Sure, I understand that. But I'd like to simplify things and not add >>> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide >>> whether sth is 0-terminated. >>> >> >> This is not only about 0-terminated. (*) >> >> It is also about when you get a STRING_CST with a TREE_STRING_LENGTH, >> there are places in the middle-end that assume that the object contains >> _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT. >> Those I want to protect. > > Well, but string_constant handles &STRING_CST just fine but in that > context there's no "object" we look at. > > IMHO whenever string_constant ends up with a DECL, looking at > ctor_for_folding and we end up with a STRING_CST that doesn't fit > in the DECL we looked at we have a bug (non-truncated STRING_CST) > and either should do the truncation or simply return NULL. > > So we should look at DECL_SIZE_UNIT and compare that with > TREE_STRING_LENGTH. Otherwise you are relying on TYPE_SIZE_UNIT > of the STRING_CST being "correct" (fitting the decl context). > >> Bernd. >> >> >> *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero, >> the string is only returned when it is properly zero terminated.", >> but maybe I should have written: >> "If MEM_SIZE is zero, the string is only returned when the storage size >> of the string object is at least TREE_STRING_LENGTH." >> That's at least exactly what the check does. > > Well, as said above > > if (TREE_CODE (array) == STRING_CST) > { > *ptr_offset = fold_convert (sizetype, offset); > if (mem_size) > *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); > return array; > } > > simply assumes the "storage size" of a STRING_CST is determined > by its TYPE_SIZE_UNIT. That may be true as you noted in the folloup, > but clearly in the above case there's nothing to protect? And in > the case we pulled the STRING_CST from some DECL_INITIAL it doesn't > protect from overflow of the FIELD_DECL unless frontends never > generate "wrong" STRING_CSTs? >
Hmm, I digged some more in varasm.c Normal STRING_CST use get_constant_size to allocate MAX(TYPE_SIZE_UNIT, TREE_STRING_LENGTH) thus they can have excess zero bytes. while STRING_CST that are in DECL_INITIALIZERs are shrinked to DECL_SIZE_UNIT. At least that difference looks unnecessary to me. But compare_constant does not compare the TYPE_SIZE_UNIT: case STRING_CST: if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2))) return 0; return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2) && ! memcmp (TREE_STRING_POINTER (t1), TREE_STRING_POINTER (t2), TREE_STRING_LENGTH (t1))); This looks now like a bug. > Btw, get_constant_size / mergeable_string_section suggsts that > we may view STRING_CST as general target-encoded byte blob. > That may be useful to compress CONSTRUCTORs memory use. > > It looks like mergeable_string_section wrongly would reject > a char[] typed string because int_size_in_bytes returns -1 > for incomplete types. I still believe using char[] for most > STRING_CSTs generated by FEs would be a good thing for > IL memory use. Shouldn't the mem_size initializations use > sth like get_constant_size as well? > I think incomplete types exist only in structs with variable length: struct { int x; char y[]; } s = { 1, "test" }; this is no candidate for a mergeable string. but char y[] = "test"; is fixed by the FE to: char y[5] = "test"; So that does not make problems with mergeable_string_section. > Also > > if (mem_size) > *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (init)); > else if (compare_tree_int (array_size, length + 1) < 0) > return NULL_TREE; > > the latter check doesn't seem to honor 'offset' and *mem_size > is again looking at the STRING_CST, not at the FIELD_DECL or > wherever we got the STRING_CST from. > the caller compares offset with *mem_size, but we do not have the FIELD_DECL at hand here (Or I did not know where it was). I'd rather fail the cases when the TYPE_SIZE_UNIT is null, or something non-constant. All those are variations of vla with initializer and similar. Once the STRING_CST has a type, I would like to use it, when it doesn't the whole thing is probably not worth it anyway. Bernd.