On 2021-09-19 22:59:31 +0200, Mattia Rizzolo wrote: > On Sun, Sep 19, 2021 at 09:45:19PM +0200, Vincent Lefevre wrote: > > On 2021-09-19 19:15:54 +0200, Mattia Rizzolo wrote: > > > I can never manage to download DTDs from w3.org (how could you?!), so, > > > taking your testcase and a copy of the same DTD: > > > > The DTD is provided by Debian, no need to download it. > > But you need to instruct xmllint to use said DTD, it won't by its own > decision to pick a random DTD from the filesystem.
No, this is not necessary with a correctly configured system. This is not a random DTD, but the DTD mentioned in the HTML file, which has the standard public identifier "-//W3C//DTD XHTML 1.0 Strict//EN" Then libxml2 can find the right file on the local file system via catalogs. In my case (which is the *default* setup with Debian packages on my system, i.e. I haven't changed anything about that in /etc): /etc/xml/catalog contains <delegatePublic publicIdStartString="-//W3C//DTD XHTML 1.0" catalog="file:///etc/xml/w3c-dtd-xhtml.xml"/> so that libxml2 then uses /etc/xml/w3c-dtd-xhtml.xml, which contains <delegatePublic publicIdStartString="-//W3C//DTD XHTML 1.0 Strict//EN" catalog="file:///usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml"/> so that libxml2 then uses /usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml, which contains <public publicId="-//W3C//DTD XHTML 1.0 Strict//EN" uri="xhtml1-strict.dtd"/> so that libxml2 gets the file /usr/share/xml/xhtml/schema/dtd/1.0/xhtml1-strict.dtd There is the same mechanism for the .ent files referenced by xhtml1-strict.dtd, i.e. via public identifiers. > I also know how to > use apt-file myself: > | % apt-file search xhtml1-strict.dtd > | dita-ot: /usr/share/dita-ot/demo/h2d/dtd/xhtml1-strict.dtd > | erlang-erl-docgen: > /usr/lib/erlang/lib/erl_docgen-1.1.1/priv/dtd/xhtml1-strict.dtd > | kate5-data: /usr/share/katexmltools/xhtml1-strict.dtd.xml > | libpxp-ocaml-dev: > /usr/share/doc/libpxp-ocaml-dev/examples/namespaces/xhtml1-strict.dtd.gz > | librdf-rdfa-parser-perl: > /usr/share/perl5/auto/share/dist/RDF-RDFa-Parser/catalogue/www.w3.org/MarkUp/DTD/xhtml1-strict.dtd > | w3-recs: > /usr/share/doc/w3-recs/html/www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml1-strict.dtd.gz > | w3c-sgml-lib: > /usr/share/xml/w3c-sgml-lib/schema/dtd/REC-xhtml1-20020801/xhtml1-strict.dtd > | xemacs21-basesupport: > /usr/share/xemacs21/xemacs-packages/etc/psgml-dtds/xhtml1-strict.dtd > | xmlcopyeditor: /usr/share/xmlcopyeditor/dtd/xhtml1-strict.dtd > | % > > indeed the one I used is the one from xmlcopyeditor (I picked a random > package, trusting that said .dtd is actually the same as all of the > above). The one I'm using is from w3c-dtd-xhtml, apparently no longer available in Debian (my machine is a Debian/unstable one installed about 5 years ago, and Debian won't replace the package by w3c-sgml-lib automatically). In any case, the concerned files from w3c-sgml-lib seem to be the same with minor differences. > My system is fine. That error message is only a red herring due to > --nonet, Everything is on the local filesystem. There is no reason to do any network access! If libxml2 tries to do a network access, this means that something on your system is broken... perhaps catalogs that are not set up correctly. > and indeed the return code of xmllint is 0. Don't look at the return code of xmllint; it is not reliable. Even in case of bad usage, it will sometimes return 0: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=727075 Validation issues are reported on stderr, e.g. with a working libxml2: $ xmllint --loaddtd --nonet --noout test.html test.html:6: parser error : EndTag: '</' not found ^ > If you prefer, I can modify the DOCTYPE and do this instead, so there > won't be "I/O error"s and the return code is clear: > > mattia@warren /tmp/tmp/xml % cat test.html > <?xml version="1.0" encoding="utf-8"?> > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" > "file:///tmp/tmp/xml/xhtml1-strict.dtd"> > <html xmlns="http://www.w3.org/1999/xhtml"> > <head><title>title</title></head> > <body><p>text</p></body> > </html> > mattia@warren /tmp/tmp/xml % xmllint --noout --nonet test.html ; echo $? > 0 Wrong test. You forgot to load the DTD! Please try: xmllint --loaddtd --noout --nonet test.html Note: you may also need to copy the 3 .ent files referenced by the DTD in the same directory: <!ENTITY % HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" "xhtml-lat1.ent"> %HTMLlat1; <!ENTITY % HTMLsymbol PUBLIC "-//W3C//ENTITIES Symbols for XHTML//EN" "xhtml-symbol.ent"> %HTMLsymbol; <!ENTITY % HTMLspecial PUBLIC "-//W3C//ENTITIES Special for XHTML//EN" "xhtml-special.ent"> %HTMLspecial; I have tried that: $ ls -l /tmp/tmp/xml total 68 -rw-r--r-- 1 vinc17 vinc17 13484 2012-04-24 22:49:16 xhtml-lat1.ent -rw-r--r-- 1 vinc17 vinc17 4486 2012-04-24 22:49:16 xhtml-special.ent -rw-r--r-- 1 vinc17 vinc17 13748 2012-04-24 22:49:16 xhtml-symbol.ent -rw-r--r-- 1 vinc17 vinc17 25473 2012-04-24 22:49:15 xhtml1-strict.dtd With libxml2 2.9.10+dfsg-6.7, strace shows that every file is loaded from this directory, and I get no output, as expected. But with libxml2 2.9.12+dfsg-4, I get: $ xmllint --loaddtd --noout --nonet test.html error : xmlAddEntity: invalid redeclaration of predefined entity error : xmlAddEntity: invalid redeclaration of predefined entity and strace still shows that every file is loaded from this directory. Something interesting: openat(AT_FDCWD, "/tmp/tmp/xml/xhtml-lat1.ent", O_RDONLY) = 5 lseek(5, 0, SEEK_CUR) = 0 read(5, "<!-- ..........................."..., 8192) = 8192 read(5, " \"×\" ><!-- multiplication "..., 16384) = 5292 read(5, "", 11092) = 0 brk(0x559087649000) = 0x559087649000 close(5) = 0 [...] openat(AT_FDCWD, "/tmp/tmp/xml/xhtml-symbol.ent", O_RDONLY) = 5 lseek(5, 0, SEEK_CUR) = 0 read(5, "<!-- ..........................."..., 8192) = 8192 read(5, " rArr can be used for 'impli"..., 16384) = 5556 read(5, "", 10828) = 0 close(5) = 0 [...] openat(AT_FDCWD, "/tmp/tmp/xml/xhtml-special.ent", O_RDONLY) = 5 lseek(5, 0, SEEK_CUR) = 0 read(5, "<!-- ..........................."..., 8192) = 4486 read(5, "", 3706) = 0 write(2, "error : ", 8) = 8 write(2, "xmlAddEntity: invalid redeclarat"..., 57) = 57 write(2, "error : ", 8) = 8 write(2, "xmlAddEntity: invalid redeclarat"..., 57) = 57 close(5) = 0 So the issue seems to occur when reading xhtml-special.ent. Hmm... there seems to be a subtle difference in xhtml-special.ent: With the file from w3c-dtd-xhtml: <!ENTITY quot """ ><!-- quotation mark = APL quote, U+0022 ISOnum --> <!ENTITY amp "&" ><!-- ampersand, U+0026 ISOnum --> <!ENTITY lt "<" ><!-- less-than sign, U+003C ISOnum --> <!ENTITY gt ">" ><!-- greater-than sign, U+003E ISOnum --> But with the file from w3c-sgml-lib: <!ENTITY lt "&#60;" ><!-- less-than sign, U+003C ISOnum --> <!ENTITY gt ">" ><!-- greater-than sign, U+003E ISOnum --> <!ENTITY amp "&#38;" ><!-- ampersand, U+0026 ISOnum --> <!ENTITY apos "'" ><!-- The Apostrophe (Apostrophe Quote, APL Quote), U+0027 ISOnum --> <!ENTITY quot """ ><!-- quotation mark (Quote Double), U+0022 ISOnum --> The errors correspond to amp and lt. Now, I don't know whether the new libxml2 version is too picky, or there was a real issue with the old entity files (ignored by all parsers until now?). In the latter case, I think that there should be a Breaks against w3c-dtd-xhtml. One more thing: I've just checked on my Debian/stable machine, which just has w3c-sgml-lib installed: "xmllint --loaddtd --nonet --noout" works without any error. Thus there should be no issue by switching w3c-dtd-xhtml to w3c-sgml-lib. -- Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)