Re: text/markdown effort in IETF (invite)

Michel Fortin Wed, 09 Jul 2014 15:08:16 -0700

Le 9-juil.-2014 à 16:08, Sean Leonard <[email protected]> a écrit :


> The operating question is: What metadata (companion data) is /necessary/ to 
> reflect the creator's intent with respect to the data?
> 
> With Markdown, I think the answer is: you need the character set, and you 
> need to know how to turn the text into HTML (or XHTML, PDF, RTF, MS 
> Word/Office Open XML, or whatever).

Indeed.


> Markdown has no way to communicate the character set in the document (other 
> than the Unicode Byte Order Marks, which is a generalized property about text 
> streams, not specific to Markdown)--and it would be counterproductive to 
> invent one. So that is a perfect example of relevant metadata.

Fun fact: PHP Markdown is mostly encoding agnostic. It understands UTF-8 
sequences but any byte that is not a valid UTF-8 sequence is treated as a 
character in itself. It's only relevant when converting tabs into spaces 
however, and only if you have non-ASCII characters before the tab.

So whatever the input encoding is becomes the output's encoding (this works for 
HTML). Naturally, it's good to know the input's encoding if you want to know 
the output's. So obviously it's a good idea to specify the text encoding even 
though the parser itself doesn't need it, so you know the resulting document's 
encoding.

That's not really relevant though.


> And the second one, is how to turn it into something else that the author 
> wants. If it's not communicated, it's going to be implied. Implied means 
> "guessing" and likely "guessing wrong".

Ideally you'd use the exact same version of the same parser the author used to 
interpret the document in the first place.

Or you could be loose and use another version of the same parser.
Or you could be loose and use another parser claiming to be of the same flavor.
Or you could be loose and use another parser claiming to be of a superset of 
the given flavor.
Or you could be loose and use another Markdown parser.

It's a spectrum. Each step down will increase the likeliness of something going 
wrong.


> Hopefully this makes sense. I want to be more educated about this.

This makes perfect sense, but I fear there's no good answer to your second 
question. Since you want to know more, here's some insight.

It's important to understand that there is no notion of invalid Markdown input. 
As an implementer every time you fix what looks like a parsing bug to you or 
add a feature you're also breaking some valid input that was producing 
something else before. The implementer will usually only choose to break valid 
input that was deemed very unlikely to ever have been used before, but there's 
no way to know for sure (and no reliable way to measure impact either). So if 
you really really want to be sure things are parsed in the intended way, you 
should use the closest version possible of the same parser as the creator of 
the document was using.

Also, subtle changes can make things technically incompatible. For instance, 
Markdown Extra is mostly a superset of the original Markdown feature-wise, 
except for one small incompatible change: underscore emphasis within a word is 
disallowed. This was a deliberate change to fix some problems users were having 
with words that contained underscore. So even though most people would consider 
Markdown Extra as a superset of Markdown, it technically isn't. Other 
implementers might do the same thing but consider it as a bug fix instead and 
tell their users implementation implements the original syntax.

Babelmark 2 will tell you that implementations are pretty much evenly split on 
this:
http://johnmacfarlane.net/babelmark2/?normalize=1&text=word_with_emphasis

You'll even see that Pandoc implements both behaviour depending on whether 
you're in strict mode or not.

Something stranger happens with the shortcut reference syntax:
http://johnmacfarlane.net/babelmark2/?normalize=1&text=%5Blink%3F%5D%0A%0A%5Blink%3F%5D%3A+http%3A%2F%2Flink.x%2F

It's pretty much universally supported. It comes from a Markdown.pl beta that 
was never formally released but which is widely in use. If you were to go to 
the Markdown website and use the download link, you won't get the beta and it 
won't work. And while the second one is feature-wise a superset of the first, 
technically it could in some rare situations break documents, turning square 
bracketed text into links where it shouldn't:

        Someone on [street Ivanhoe Carol][sIC] told me this:

        > This is bad [sic].

        [sIC]: http://sic.sickdomain

I sure wish things would be simpler. But as things are now, I have a hard time 
identifying what "flavor" could mean. Should "Markdown.pl-1.0.1" be a flavor on 
its own?


-- 
Michel Fortin
[email protected]
http://michelf.ca

_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Re: text/markdown effort in IETF (invite)

Reply via email to