On 26/04/2023 21:33, Albretch Mueller wrote:
a) the crazy long name
b) its base64 representation
c) §b's sha256sum representation which is the one used for the file
name and the log of the download.
I see no point in base64 step since sha may be calculated for original
URI directly. However an important step of URI normalization is missed:
- often http: and https: are alternatives
- domain name may contain unicode characters or be represented as pure
ASCII punycode
- #anchors (sometimes empty #) at the end of URI usually does not change
served content. It may be abused however by some web application to
provide content dependent of anchors. Or a web page may hide parts of
its content using CSS depending on the anchor. So its stripping may
cause troubles.
- Session or user activity tracking query ("search") parameters that
must be stripped for archival purposes
- Some parts of URI may be percent encoded keeping equivalence with
"canonical" URI
- Web page may suggest "canonical" URL, but sometimes it is a misleading
hint.
So URI comparison is not a trivial task.
Another point is that the same page may be saved multiple times, so URI
hash is not enough for unique key.
On 26/04/2023 21:48, Nicolas George wrote:
OTOH, HTTP does have a place to state the type of the file, and the
extension in URLs is not reliable: if you want to do it properly, you
must set your local file extension based on the Content-Type response
header.
And you will quickly face servers that sends incorrectly Content-Type or
intentionally put application/octet-stream with no sniff header to force
browser to save the file instead of opening it e.g. in built-in PDF reader.