benhamilton added inline comments.
================ Comment at: clang-tools-extra/trunk/clangd/SourceCode.cpp:38 + // 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here. + // 11111xxx is not valid UTF-8 at all. Assert because it's probably our bug. + assert((UTF8Length >= 2 && UTF8Length <= 4) && ---------------- sammccall wrote: > benhamilton wrote: > > sammccall wrote: > > > benhamilton wrote: > > > > This is user input, right? Have we actually checked for valid UTF-8, or > > > > do we just assume it's valid? > > > > > > > > If not, it seems like an assertion is not the right check, but we > > > > should reject it when we're reading the input. > > > > > > > Yeah, I wasn't sure about this, offline discussion tentatively concluded > > > we wanted an assert, but I'm happy to switch to something else. > > > > > > We don't validate the code on the way in, so strings are "bytes of > > > presumed-UTF8". This is usually not a big pain actually. But we > > > could/should certainly make the JSON parser validate the UTF-8. (If we > > > want to go this route, D45753 should be resolved first). > > > > > > There's two ways the assertion could fire: the code is invalid UTF-8, or > > > there's a bug in the unicode logic here. I thought the latter was more > > > likely at least in the short-term :) and this is the least invasive way > > > to catch it. And if a developer build (assert-enabled) crashes because an > > > editor feeds it invalid bytes, then that's probably better than doing > > > nothing (though not as good as catching the error earlier). > > The problem with not validating is it's easy to cause OOB memory access > > (and thus security issues) if someone crafts malicious UTF-8 and makes us > > read off the end of a string. > > > > We should be clear about the status of all strings in the documentation to > > APIs. > You still have to find/write a UTF-8 decoder that doesn't check bounds, which > is (hopefully!) the harder part of writing that bug :-) > But I agree in principle, there's more subtle attacks too, like `C08A` which > is invalid but non-validating decoders will treat as a newline. > > I have a nearly-finished patch to add real validation to the JSON library, > I'll copy you on it. Seems like this particular decoder isn't checking bounds, eh? ;) If `NDEBUG` is set, it will happily set `UTF8Length` to however many leading 1s there are (valid or not) and pass that to `CB(UTF8Length)`. It's true that the current callbacks passed in won't directly turn that into an OOB memory access, but they will end up returning an invalid UTF-16 code unit length from `positionToOffset()`, so who knows what that will end up doing. Thanks, I'm always happy to review Unicode stuff. Repository: rL LLVM https://reviews.llvm.org/D46035 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits