On Thursday, 26 December 2024 13:52:02 CET Walter Lozano wrote: > I see, thanks for clarifying. I wonder if there is a kind of > specification which mentions that copyright notice should be utf-8 or if > it is just the common case.
AFAIK, there's no specification. But this makes licensecheck and cme much more reliable. Currently, cme receives parsed copyrights and licences from the stdout of licensecheck. Currently I cannot detect which part is utf-8 or something else. There's no reliable way to detect this. So I prefer to push upstream to clean up. > > I'd suggest you talk with upstream to fix the encoding of the source > > files. > > Thank you, yes, I will discuss this with upstream. Thanks > I understand that some of these corner cases are triggered by the new > features in scan-copyrights which tries to get better scanning results > which is something I really appreciate. I happy to hear that :-) > In this context, I wonder if in > general when trying to parse a copyright notice and some "strange data" > is found the tool should print a warning and report "UNKNOWN" cme emits a warning with the unexpected copyright year range and discard them. Encoding issues should be detected by licensecheck. I don't think it can do that. Hence, garbage in, garbage out. All the best