On Thu, Sep 19, 2024 at 08:17:24AM +0200, Richard Biener wrote: > On Wed, Sep 18, 2024 at 7:33 PM Jakub Jelinek <ja...@redhat.com> wrote: > > > > On Wed, Sep 18, 2024 at 06:17:58PM +0100, Richard Sandiford wrote: > > > +1 I'd much rather learn about this kind of error before the code reaches > > > a review tool :) > > > > > > >From a quick check, it doesn't look like Clang has this, so there is no > > > existing name to follow. > > > > I was considering also -Wtrailing-whitespace, but > > 1) git diff really warns just about trailing spaces/tabs, not form feeds or > > vertical tabs > > 2) gcc source contains tons of spots with form feed in it (though, > > I think pretty much always as the sole character on a line). > > And not really sure how people use vertical tabs in the source if at all. > > Perhaps form feed could be not warned if at end of line if it isn't the sole > > character on a line... > > Generally I like diagnosing this early. For the above I'd say > -Wtrailing-whitespace= > with a set of things to diagnose (and a sane default - just spaces and > tabs - for > -Wtrailiing-whitespace) would be nice. As for naming possibly follow the > is{space,blank,cntrl} character classifications? If those are a good > fit, that is.
I think the character classifications risk problems. space is ' ' '\t' '\n' '\r' '\f' '\v' in the C locale, blank is ' ' '\t' cntrl is a lot of chars but not ' ' if we extend by the safe-ctype vspace '\r' '\n' nvspace ' ' '\t' '\f' '\v' '\0' Obviously, we shouldn't look at '\r' and '\n', those aren't trailing characters, those are line separators. Would we need to consider all UTF-8 (or EBCDIC-UTF) control characters is cntrl? 0000..0009 ; Control # Cc [10] <control-0000>..<control-0009> 000B..000C ; Control # Cc [2] <control-000B>..<control-000C> 000E..001F ; Control # Cc [18] <control-000E>..<control-001F> 007F..009F ; Control # Cc [33] <control-007F>..<control-009F> 00AD ; Control # Cf SOFT HYPHEN 061C ; Control # Cf ARABIC LETTER MARK 180E ; Control # Cf MONGOLIAN VOWEL SEPARATOR 200B ; Control # Cf ZERO WIDTH SPACE 200E..200F ; Control # Cf [2] LEFT-TO-RIGHT MARK..RIGHT-TO-LEFT MARK 2028 ; Control # Zl LINE SEPARATOR 2029 ; Control # Zp PARAGRAPH SEPARATOR 202A..202E ; Control # Cf [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE 2060..2064 ; Control # Cf [5] WORD JOINER..INVISIBLE PLUS 2065 ; Control # Cn <reserved-2065> 2066..206F ; Control # Cf [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES FEFF ; Control # Cf ZERO WIDTH NO-BREAK SPACE FFF0..FFF8 ; Control # Cn [9] <reserved-FFF0>..<reserved-FFF8> FFF9..FFFB ; Control # Cf [3] INTERLINEAR ANNOTATION ANCHOR..INTERLINEAR ANNOTATION TERMINATOR 13430..1343F ; Control # Cf [16] EGYPTIAN HIEROGLYPH VERTICAL JOINER..EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE 1BCA0..1BCA3 ; Control # Cf [4] SHORTHAND FORMAT LETTER OVERLAP..SHORTHAND FORMAT UP STEP 1D173..1D17A ; Control # Cf [8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL END PHRASE E0000 ; Control # Cn <reserved-E0000> E0001 ; Control # Cf LANGUAGE TAG E0002..E001F ; Control # Cn [30] <reserved-E0002>..<reserved-E001F> E0080..E00FF ; Control # Cn [128] <reserved-E0080>..<reserved-E00FF> E01F0..E0FFF ; Control # Cn [3600] <reserved-E01F0>..<reserved-E0FFF> Wonder why anybody would be interested to find just trailing spaces and not trailing tabs or vice versa, so if we have categories, blank would be one, then perhaps nvspace as something not including '\0', so just ' ' '\t' '\f' '\v' and if really needed, control characters with added ' ', but how to call that and would it really need to parse UTF-8/EBCDIC and look at pregenerated tables? Jakub