[PATCH] D138073: [clang-doc] Move file layout to the generators.

Paul Kirth via Phabricator via cfe-commits Fri, 18 Nov 2022 16:32:45 -0800

paulkirth accepted this revision.
paulkirth added a comment.
This revision is now accepted and ready to land.


This is mostly LGTM, modulo the small change to the test that I've suggested. 
Assuming the presubmit tests still pass with that small edit, you're good on my 
end.



================
Comment at: clang-tools-extra/clang-doc/HTMLGenerator.cpp:890-892
+    // TODO If there are multiple Infos for this file name (for example,
+    // template specializations), this will generate multiple complete web 
pages
+    // (with <DOCTYPE> and <title>, etc.) concatenated together. This generator
----------------
brettw wrote:
> paulkirth wrote:
> > brettw wrote:
> > > paulkirth wrote:
> > > > So does this patch make HTML unusable? Currently, if there is a 
> > > > conflict, the file gets replaced, right? 
> > > > 
> > > > I read this as we're going from a situation where we lost data, but 
> > > > still had valid HTML pages to one where we have complete, but incorrect 
> > > > HTML, is that correct?
> > > > 
> > > > Also, why does that happen under template specialization? USRs are SHA1 
> > > > hashes of the symbol. I would expect that hashing the symbol would be 
> > > > unambiguous and unique, given C++'s ODR. 
> > > > 
> > > > That last point made me wonder if an alternate way to handle this 
> > > > without using the hash would be to use the mangled names?
> > > I think this makes HTML strictly better (though not ideal).
> > > 
> > > Previously if there was a collision, they got written over the top of 
> > > each other without truncation (not sure why). So if the longer page was 
> > > written second, you got lucky and you get a complete valid page and just 
> > > lost the shorter struct's definition. If you got unlucky the shorter page 
> > > was written second and you get a corrupted file with the shorter content, 
> > > followed by the end of the longer content peeking out from the end.
> > > 
> > > With this change, you get both content concatenated without corruption, 
> > > you just have duplicate doctype and title tags in between them. Browsers 
> > > should generally handle this so the page should be usable.
> > > 
> > > Fixing this requires a refactor of the HTML generator such that the 
> > > concept of writing a struct is separate from writing a page. Even if I 
> > > was going to work on the HTML generator, I would probably want to do this 
> > > in a separate patch anyway.
> > > 
> > > The markdown generator does the same, but markdown doesn't have all of 
> > > the headers stuff that shouldn't be in the middle of the file, you just 
> > > get a new `# ...` for the second section and it should be reasonable.
> > > 
> > > The HTML and MD generators name files according to the `Name` of the 
> > > structure. USR is not involved. So any time to things in the same 
> > > namespace have the same name, you'll get a collision. This happens most 
> > > commonly with template specialization because the names are identical 
> > > (the name doesn't account for the template parameters).
> > > 
> > > All of this complexity is why I changed the YAML output to use USR which 
> > > avoids the problem. I didn't want to make this change for the HTML and MD 
> > > generators because it will make ugly file names that the user will see, 
> > > and you probably want the template specializations to be in the same file 
> > > anyway.
> > > I think this makes HTML strictly better (though not ideal).
> > > 
> > > Previously if there was a collision, they got written over the top of 
> > > each other without truncation (not sure why). So if the longer page was 
> > > written second, you got lucky and you get a complete valid page and just 
> > > lost the shorter struct's definition. If you got unlucky the shorter page 
> > > was written second and you get a corrupted file with the shorter content, 
> > > followed by the end of the longer content peeking out from the end.
> > > 
> > > With this change, you get both content concatenated without corruption, 
> > > you just have duplicate doctype and title tags in between them. Browsers 
> > > should generally handle this so the page should be usable.
> > > 
> > 
> > Thanks for the explanation. My concern was that this change made HTML 
> > unusable, if it is still valid (if not ideal) then I don't think there is 
> > an issue.
> > 
> > > Fixing this requires a refactor of the HTML generator such that the 
> > > concept of writing a struct is separate from writing a page. Even if I 
> > > was going to work on the HTML generator, I would probably want to do this 
> > > in a separate patch anyway.
> > > 
> > 
> > No arguments re splitting up work. Can you file an issue to track this on 
> > github and reference it in the TODO? That should help make sure this gets 
> > addressed.
> > 
> > > The markdown generator does the same, but markdown doesn't have all of 
> > > the headers stuff that shouldn't be in the middle of the file, you just 
> > > get a new `# ...` for the second section and it should be reasonable.
> > > 
> > > The HTML and MD generators name files according to the `Name` of the 
> > > structure. USR is not involved. So any time to things in the same 
> > > namespace have the same name, you'll get a collision. This happens most 
> > > commonly with template specialization because the names are identical 
> > > (the name doesn't account for the template parameters).
> > > 
> > 
> > I think there should be a way to obtain the mangled name from whatever 
> > you're looking at. Those should be available, even from the AST, so if 
> > you've found a template specialization, it //should// be possible to get 
> > the mangled name (even for partial specializations).  This shouldn't 
> > necessarily block you, but it seems like that would solve the issue nicely 
> > for all three generators.
> TODO added.
> 
> I don't think the mangled names helps both because I think you want the file 
> names to be readable (the mangled names are barely better than the USR) and 
> because I think you also want the specializations to all end up in the same 
> file for end-user documentation. I discussed this in the bug.
> TODO added.
> 
> I don't think the mangled names helps both because I think you want the file 
> names to be readable (the mangled names are barely better than the USR) and 
> because I think you also want the specializations to all end up in the same 
> file for end-user documentation. I discussed this in the bug.

Name mangling less than ideal, but it's an established part of C++. I don't 
know how important it is that file names be completely readable. Lots of web 
URLs are not, and have little impact on users. Mangled names are also commonly 
used in doxygen's HTML generation, so there is a precedent.  

Also, I'm not sure I follow your argument that file names should be readable, 
but you're willing to use a SHA1 as a file name? Mangled names may not be 
completely intuitive, but you can quickly figure out how to read them, where as 
you cannot do so with a SHA.

My questions about this pertain to a more unified solution that could work 
across generators. Do you think that if we just substituted the mangled names, 
we would no longer have the file corruption? From your description of the 
underlying problem, I would assume "yes", but I'd like your opinion. We can 
ignore the question about whether they should be grouped under a single file 
for now, as I view that as a different, but related issue. I haven't' had time 
to read the issue on github yet, but will soon.


================
Comment at: clang-tools-extra/test/clang-doc/single-file-public.cpp:9
+//   for a name with a 40-char USR name.
+// RUN: find %t -regex 
".*[0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z].*"
 -exec cat {} \; | FileCheck %s --check-prefix=CHECK
 // RUN: rm -rf %t
----------------



================
Comment at: clang-tools-extra/test/clang-doc/single-file.cpp:14
 // CHECK: ---
 // CHECK-NEXT: USR:             
'{{[0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]}}'
 // CHECK-NEXT: ChildFunctions:
----------------
I don't expect to fix these, since they're not part of your change, but this is 
a more concise way to do the same check for a USR using the FileCheck regular 
expressions. I doubt you need it, but just in case.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D138073/new/

https://reviews.llvm.org/D138073

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D138073: [clang-doc] Move file layout to the generators.

Reply via email to