On Thu, Sep 19, 2024 at 08:17:24AM +0200, Richard Biener wrote:
> On Wed, Sep 18, 2024 at 7:33 PM Jakub Jelinek <ja...@redhat.com> wrote:
> >
> > On Wed, Sep 18, 2024 at 06:17:58PM +0100, Richard Sandiford wrote:
> > > +1  I'd much rather learn about this kind of error before the code reaches
> > > a review tool :)
> > >
> > > >From a quick check, it doesn't look like Clang has this, so there is no
> > > existing name to follow.
> >
> > I was considering also -Wtrailing-whitespace, but
> > 1) git diff really warns just about trailing spaces/tabs, not form feeds or
> > vertical tabs
> > 2) gcc source contains tons of spots with form feed in it (though,
> > I think pretty much always as the sole character on a line).
> > And not really sure how people use vertical tabs in the source if at all.
> > Perhaps form feed could be not warned if at end of line if it isn't the sole
> > character on a line...
> 
> Generally I like diagnosing this early.  For the above I'd say
> -Wtrailing-whitespace=
> with a set of things to diagnose (and a sane default - just spaces and
> tabs - for
> -Wtrailiing-whitespace) would be nice.  As for naming possibly follow the
> is{space,blank,cntrl} character classifications?  If those are a good
> fit, that is.

I think the character classifications risk problems.

space is ' ' '\t' '\n' '\r' '\f' '\v' in the C locale,
blank is ' ' '\t'
cntrl is a lot of chars but not ' '
if we extend by the safe-ctype
vspace '\r' '\n'
nvspace ' ' '\t' '\f' '\v' '\0'
Obviously, we shouldn't look at '\r' and '\n', those aren't trailing
characters, those are line separators.

Would we need to consider all UTF-8 (or EBCDIC-UTF) control characters is
cntrl?
0000..0009    ; Control # Cc  [10] <control-0000>..<control-0009>
000B..000C    ; Control # Cc   [2] <control-000B>..<control-000C>
000E..001F    ; Control # Cc  [18] <control-000E>..<control-001F>
007F..009F    ; Control # Cc  [33] <control-007F>..<control-009F>
00AD          ; Control # Cf       SOFT HYPHEN
061C          ; Control # Cf       ARABIC LETTER MARK
180E          ; Control # Cf       MONGOLIAN VOWEL SEPARATOR
200B          ; Control # Cf       ZERO WIDTH SPACE
200E..200F    ; Control # Cf   [2] LEFT-TO-RIGHT MARK..RIGHT-TO-LEFT MARK
2028          ; Control # Zl       LINE SEPARATOR
2029          ; Control # Zp       PARAGRAPH SEPARATOR
202A..202E    ; Control # Cf   [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT 
OVERRIDE
2060..2064    ; Control # Cf   [5] WORD JOINER..INVISIBLE PLUS
2065          ; Control # Cn       <reserved-2065>
2066..206F    ; Control # Cf  [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES
FEFF          ; Control # Cf       ZERO WIDTH NO-BREAK SPACE
FFF0..FFF8    ; Control # Cn   [9] <reserved-FFF0>..<reserved-FFF8>
FFF9..FFFB    ; Control # Cf   [3] INTERLINEAR ANNOTATION ANCHOR..INTERLINEAR 
ANNOTATION TERMINATOR
13430..1343F  ; Control # Cf  [16] EGYPTIAN HIEROGLYPH VERTICAL 
JOINER..EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE
1BCA0..1BCA3  ; Control # Cf   [4] SHORTHAND FORMAT LETTER OVERLAP..SHORTHAND 
FORMAT UP STEP
1D173..1D17A  ; Control # Cf   [8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL 
END PHRASE
E0000         ; Control # Cn       <reserved-E0000>
E0001         ; Control # Cf       LANGUAGE TAG
E0002..E001F  ; Control # Cn  [30] <reserved-E0002>..<reserved-E001F>
E0080..E00FF  ; Control # Cn [128] <reserved-E0080>..<reserved-E00FF>
E01F0..E0FFF  ; Control # Cn [3600] <reserved-E01F0>..<reserved-E0FFF>

Wonder why anybody would be interested to find just trailing spaces and not
trailing tabs or vice versa, so if we have categories, blank would be one,
then perhaps nvspace as something not including '\0', so just ' ' '\t' '\f'
'\v' and if really needed, control characters with added ' ', but how to
call that and would it really need to parse UTF-8/EBCDIC and look at
pregenerated tables?

        Jakub

Reply via email to