text/markdown effort in IETF (invite)

I would strongly recommend thinking through some of the "political"
decisions before getting too far into this.

1) "Markdown" officially refers to the implementation and syntax created
by John Gruber.

2) "Markdown" the perl implementation has not seen a bug fix in nearly
10 years.

3) Gruber's voice has been noticeably absent from the list for a long
time, except for a comment that I recall as basically saying that
Markdown was essentially feature complete as far as he was concerned.

4) Gruber has specifically said in the past that new projects could not
coopt the "Markdown" name and would have to be clearly disambiguated.
For example, I would assume that anyone other than Gruber could not
create "Markdown 2.0" to be the Markdown to rule them all...

5) I don't have numbers to back this up, but would strongly suspect that
at this point very few people who think they use "Markdown" actually
are. Most are using various derivatives that have made wide-ranging
decisions on how to handle edge cases, etc. For most users, whose needs
are very basic, the distinction is probably academic. But I would
suggest that these distinctions are very important when it comes to
official standards.

I would propose that if there is to be an official standard based on
"Markdown", it would first require defining what "Markdown" is. To do
that would (hopefully) require a more formalized description of the
grammar. If Gruber were to sign off on allowing this to use the
"Markdown" name, fantastic. But if not, a difficult decision would need
to be made:

1) Build a standard based on Markdown.pl, bugs and all, and keep the
"Markdown" name.

2) Develop a formalized version of the core syntax of Markdown, and base
the standard on this. Unless it were to receive Gruber's blessing, it
would have to be named something other than Markdown.

3) Continue to use the term "Markdown" as a vague term that refers to a
loosely related collection of tools, leaving users to wonder why a given
document works with one tool, and not others. At some point, a new
common standard (e.g. "Son of Markdown" or whatever) may or may not
arise that would require redefining all of this stuff. Granted, efforts
to organize such a standard have thus far failed despite multiple
enthusiastic discussions over the years on this list.

My $.02....

FTP

Post by Sean Leonard
I am working on a Markdown effort in the Internet Engineering Task
Force, to standardize on "text/markdown" as the Internet media type for
<http://tools.ietf.org/html/draft-seantek-text-markdown-media-type-00>.
The proposal is already getting traction. Is there anyone on this list
that is interested in participating or helping this effort? In
particular we need to better understand and document what versions of
Markdown exist, so that either Markdown as a family of informal syntaxes
will start to converge, or if not, that Markdown variations have an easy
way to be distinguished from one another. (See the "flavor" parameter
discussed in the draft.)
Kind regards,
Sean Leonard
Author of Markdown IETF Draft
_______________________________________________
Markdown-Discuss mailing list
http://six.pairlist.net/mailman/listinfo/markdown-discuss

--
Fletcher T. Penney
***@fletcherpenney.net

Dennis E. Hamilton

2014-07-09 16:58:04 UTC

I think the Internet draft is very clear. It is not a Standards Track project. It is a MIME-type registration proposal and the procedure for determination of flavors should satisfy whatever concerns there are. In general, a MIME-type registration has to point to some place where there is a description of the format. These are not particularly definitive or authoritative in some cases, and this registration could fail for lack of something definitive. That is best dealt with on the IETF discussion list.

I have nothing to offer concerning "official" Markdown. It would appear that the term has already been appropriated as a common noun and there is no means to protect against that being otherwise.

-- Dennis E. Hamilton
***@acm.org +1-206-779-9430
https://keybase.io/orcmid PGP F96E 89FF D456 628A

-----Original Message-----
From: Markdown-Discuss [mailto:markdown-discuss-***@six.pairlist.net] On Behalf Of Fletcher T. Penney
Sent: Wednesday, July 9, 2014 09:08
To: Discussion related to Markdown.
Subject: Re: text/markdown effort in IETF (invite)

I would strongly recommend thinking through some of the "political"
decisions before getting too far into this.

1) "Markdown" officially refers to the implementation and syntax created
by John Gruber.

2) "Markdown" the perl implementation has not seen a bug fix in nearly
10 years.

3) Gruber's voice has been noticeably absent from the list for a long
time, except for a comment that I recall as basically saying that
Markdown was essentially feature complete as far as he was concerned.

4) Gruber has specifically said in the past that new projects could not
coopt the "Markdown" name and would have to be clearly disambiguated.
For example, I would assume that anyone other than Gruber could not
create "Markdown 2.0" to be the Markdown to rule them all...

5) I don't have numbers to back this up, but would strongly suspect that
at this point very few people who think they use "Markdown" actually
are. Most are using various derivatives that have made wide-ranging
decisions on how to handle edge cases, etc. For most users, whose needs
are very basic, the distinction is probably academic. But I would
suggest that these distinctions are very important when it comes to
official standards.

I would propose that if there is to be an official standard based on
"Markdown", it would first require defining what "Markdown" is. To do
that would (hopefully) require a more formalized description of the
grammar. If Gruber were to sign off on allowing this to use the
"Markdown" name, fantastic. But if not, a difficult decision would need
to be made:

1) Build a standard based on Markdown.pl, bugs and all, and keep the
"Markdown" name.

2) Develop a formalized version of the core syntax of Markdown, and base
the standard on this. Unless it were to receive Gruber's blessing, it
would have to be named something other than Markdown.

3) Continue to use the term "Markdown" as a vague term that refers to a
loosely related collection of tools, leaving users to wonder why a given
document works with one tool, and not others. At some point, a new
common standard (e.g. "Son of Markdown" or whatever) may or may not
arise that would require redefining all of this stuff. Granted, efforts
to organize such a standard have thus far failed despite multiple
enthusiastic discussions over the years on this list.

My $.02....

FTP

--
Fletcher T. Penney
***@fletcherpenney.net

Sean Leonard

2014-07-09 19:06:35 UTC

Hello everyone,

Post by Dennis E. Hamilton
I think the Internet draft is very clear. It is not a Standards Track project. It is a MIME-type registration proposal and the procedure for determination of flavors should satisfy whatever concerns there are.

That is correct. My purpose in creating this first draft is not to make
a Markdown standard; it is to identify Markdown content in a
standardized way, namely, with text/markdown.

Post by Dennis E. Hamilton
In general, a MIME-type registration has to point to some place where there is a description of the format. These are not particularly definitive or authoritative in some cases, and this registration could fail for lack of something definitive. That is best dealt with on the IETF discussion list.

Yes, this is a sticking point. Experienced IETFers will raise (and have
raised) concerns about the authoritative-ness of the format. But IETFers
have less experience with Markdown compared to you all, which is why I'm
bringing it up here (and elsewhere).

Post by Dennis E. Hamilton
I have nothing to offer concerning "official" Markdown. It would appear that the term has already been appropriated as a common noun and there is no means to protect against that being otherwise.

I am of the same view. Anyone can call anything "Markdown"--no one is
stopping them. Just as anyone can call anything "ASCII art" or "mashups"
(i.e., there might be an ASCII standard but what people do with it is
totally different--it has become a cultural phenomenon). In the draft, I
restricted the eligible formats to "things based on John Gruber's
original Markdown tool and syntax from 2004".

Some realities are apparent, at least to me:
1. Markdown is a real thing. It's not plain text and it's not HTML--it's
something different. (Heck, this list could be Markdown!)
2. People are using Markdown for real things of economic and social value.
3. Markdown is different from other _lightweight markup languages_.
I.e., it's not reStructuredText, BBCode, javadocs, or Creole (wiki
markup). But unlike the aforementioned examples, there is no authority
that guides its development. (reStructuredText is a Python thing, for
example, so the Python people are in charge of it.)
4. Things that are called Markdown (MultiMarkdown, GitHub Flavored
Markdown, etc.) share more in common with each other than those in
#3--therefore these things are related.
5. People are storing and exchanging Markdown-as-Markdown between
systems. Not Markdown-as-plain-text, and not Markdown-as-HTML. Thus,
there is a need for standardized interchange.

-Sean

Fletcher T. Penney

2014-07-09 19:18:36 UTC

I disagree with the section quoted below.

To my knowledge, Gruber has not officially trademarked "Markdown".

Markdown was a word before Gruber used it, but for different contexts.

I am not a lawyer.

However, in the world of honest people, the word "Markdown" as applied
to lightweight text formats belongs to Gruber. Others may play off of
it (PHP Markdown Extra, my own MultiMarkdown, etc.), but I can't create
an entirely new syntax and call it Markdown.

FTP

Post by Sean Leonard
I am of the same view. Anyone can call anything "Markdown"--no one is
stopping them. Just as anyone can call anything "ASCII art" or "mashups"
(i.e., there might be an ASCII standard but what people do with it is
totally different--it has become a cultural phenomenon). In the draft, I
restricted the eligible formats to "things based on John Gruber's
original Markdown tool and syntax from 2004".

--
Fletcher T. Penney
***@fletcherpenney.net

Jason Davies

2014-07-09 22:18:02 UTC

While I don't disagree with these points, I don't think they are
necessarily *the* point.

Markdown is -- sometime-- used as Markdown, by which I mean I read it
raw and send it to people raw. But the vast majority of the time, it's a
lightweight mark-up language and - most importantly - a transitional
mode. It becomes something else (html, TeX, opml etc).

Its virtues are simplicity and adaptability. So, for instance, in
Mailmate I can write in markdown and it will be interpreted into html
rich text. But for that very reason, it doesn't implement numbered lists
(the developer explained to me that this became problematic when you
convert *back*, in a reply, which Mailmate has a good stab at.)

In other words, there are going to be reasons why someone might
implement in differently for valid reasons (which you may or may not
agree with). So you could say it's not true Markdown (well, it's not if
you use Gruber's syntax). But its simplicity and growing popularity
means that it's too tempting to use it: otherwise he would have to
invent a parallel beta-code with a different syntax which is as
frustrating as the way that different wikis use different mark-up
(drives me nuts...I can never remember the different dialects).

So if you created a 'standard markdown', as HTML 5 did, you would also
have a bunch of people who wouldn't implement it fully. HTML 5 was made
mission critical by two things, it seems to me: 1) Microsoft's
deliberate attempts to break HTML's universal rendering forced the
community to unite and sort it out 2) the fact that massive commercial
and social implications arose from websites not working properly. If IE
had not been such a pain to code for, and the consequences of the minor
variations were not great, you'd never have had HTML 5 -- there would
not have been the will.

Markdown does not currently have that scenario. Its greatest asset is
its relatively low-level specification and elegance. So, for instance,
if you want to convert it to LaTeX or OPML, you can through
multimarkdown. If your website, written in markdown, doesn't work
properly, you just fix it because you have a standard to fix it against
(thanks to 5).

So, without a strong impetus to enforce co-operation, If you created a
new standard, you would *still* end up with one 'true' markdown and
several variants which people would implement to suit their purposes --
which is precisely what you have now. There is a near-perfect
specification, and there are variants.

There is not the urgency in this case: enforcement is therefore going to
be a voluntary adherence to a single spec (or not). In other words, we
are already wherever we are going to end up, with a few details changed.

Post by Sean Leonard
1. Markdown is a real thing. It's not plain text and it's not
HTML--it's something different. (Heck, this list could be Markdown!)
2. People are using Markdown for real things of economic and social value.
3. Markdown is different from other _lightweight markup languages_.
I.e., it's not reStructuredText, BBCode, javadocs, or Creole (wiki
markup). But unlike the aforementioned examples, there is no authority
that guides its development. (reStructuredText is a Python thing, for
example, so the Python people are in charge of it.)
4. Things that are called Markdown (MultiMarkdown, GitHub Flavored
Markdown, etc.) share more in common with each other than those in
#3--therefore these things are related.
5. People are storing and exchanging Markdown-as-Markdown between
systems. Not Markdown-as-plain-text, and not Markdown-as-HTML. Thus,
there is a need for standardized interchange.

Jason Davies

2014-07-09 17:48:27 UTC

Unless it were to receive Gruber's blessing, it would have to be named
something other than Markdown.

Really good summary Fletcher. I think unless someone steps up to create
Son of Markdown as a project, we should all live with your third option
(and I am just an inexpert user, would not be able to contribute any
code/coding logic, so I'm not volunteering).

We could go classical for the name: the latin would be 'subscribe' which
is not very helpful. (Very bad literal classical) Greek for Son of
Markdown would be something like 'Hypographides', retaining the allusion
to Mark-down without actually using Gruber's name.

The elusive acronym 'HGP' appeal to anyone?;)

Michel Fortin

2014-07-09 19:06:20 UTC

I am working on a Markdown effort in the Internet Engineering Task Force, to standardize on "text/markdown" as the Internet media type for all variations of Markdown content. You can read my draft here: <http://tools.ietf.org/html/draft-seantek-text-markdown-media-type-00>.
The proposal is already getting traction. Is there anyone on this list that is interested in participating or helping this effort? In particular we need to better understand and document what versions of Markdown exist, so that either Markdown as a family of informal syntaxes will start to converge, or if not, that Markdown variations have an easy way to be distinguished from one another. (See the "flavor" parameter discussed in the draft.)

The "flavor" parameter is a good idea in theory. I'm not sure it'll be very useful in general though. Nobody is going to annotate their file with the right flavor unless there's a tangible benefit, and I don't see what the benefit could be. Software that could do something useful with markdown-identified content will likely ignore the flavor part when parsing because no one wants to see "incompatible flavor" errors, especially when commonly used parts of the syntax are compatible anyway.

Markdown is in the spot where HTML was before HTML5 with each implementation doing its own thing. I don't know if Markdown will get out of there anytime soon. I'll point out however that HTML never got anything like a "flavor" parameter in its MIME type, and even if it did it'd not have helped clear the mess in any way.

--
Michel Fortin
***@michelf.ca
http://michelf.ca

Dennis E. Hamilton

2014-07-09 19:42:41 UTC

"Flavor" was handled in HTML with the DTD, FWIW.

-----Original Message-----
From: Markdown-Discuss [mailto:markdown-discuss-***@six.pairlist.net] On Behalf Of Michel Fortin
Sent: Wednesday, July 9, 2014 12:06
To: Discussion related to Markdown.
Subject: Re: text/markdown effort in IETF (invite)

[ ... ]
Markdown is in the spot where HTML was before HTML5 with each implementation doing its own thing. I don't know if Markdown will get out of there anytime soon. I'll point out however that HTML never got anything like a "flavor" parameter in its MIME type, and even if it did it'd not have helped clear the mess in any way.

--
Michel Fortin
***@michelf.ca
http://michelf.ca

Sean Leonard

2014-07-09 20:00:21 UTC

The "flavor" parameter is a good idea in theory. I'm not sure it'll be very useful in general though. Nobody is going to annotate their file with the right flavor unless there's a tangible benefit, and I don't see what the benefit could be. Software that could do something useful with markdown-identified content will likely ignore the flavor part when parsing because no one wants to see "incompatible flavor" errors, especially when commonly used parts of the syntax are compatible anyway.
Markdown is in the spot where HTML was before HTML5 with each implementation doing its own thing. I don't know if Markdown will get out of there anytime soon. I'll point out however that HTML never got anything like a "flavor" parameter in its MIME type, and even if it did it'd not have helped clear the mess in any way.

Ok so here is where I really want to focus and learn some stuff from the
Markdown community. I am a fairly heavy Markdown user, but not a
Markdown developer or maintainer [yet].

Waylan Limberg

2014-07-09 20:54:27 UTC

I think this comment [1] by Gruber on the mailing list in the past can shed
some light on what the spec is (is it markdown.pl, the syntax rules on
daringfireball.net, some mashup of various implementations, or something
else?). According to Gruber, it is the syntax rules and that's it. If that
is not "good enough" to get a mime-type, then I dont think there is
anything else we can do.

[1]:
http://six.pairlist.net/pipermail/markdown-discuss/2008-February/001001.html

-----Original Message-----
From: Markdown-Discuss [mailto:markdown-discuss-***@six.pairlist.net] On
Behalf Of Sean Leonard
Sent: Wednesday, July 09, 2014 4:00 PM
To: markdown-***@six.pairlist.net
Subject: Re: text/markdown effort in IETF (invite)

I am working on a Markdown effort in the Internet Engineering Task Force,

to standardize on "text/markdown" as the Internet media type for all
variations of Markdown content. You can read my draft here:
<http://tools.ietf.org/html/draft-seantek-text-markdown-media-type-00>.

The proposal is already getting traction. Is there anyone on this
list that is interested in participating or helping this effort? In
particular we need to better understand and document what versions of
Markdown exist, so that either Markdown as a family of informal
syntaxes will start to converge, or if not, that Markdown variations
have an easy way to be distinguished from one another. (See the
"flavor" parameter discussed in the draft.)

The "flavor" parameter is a good idea in theory. I'm not sure it'll be

very useful in general though. Nobody is going to annotate their file with
the right flavor unless there's a tangible benefit, and I don't see what the
benefit could be. Software that could do something useful with
markdown-identified content will likely ignore the flavor part when parsing
because no one wants to see "incompatible flavor" errors, especially when
commonly used parts of the syntax are compatible anyway.

Post by Michel Fortin
Markdown is in the spot where HTML was before HTML5 with each

implementation doing its own thing. I don't know if Markdown will get out of
there anytime soon. I'll point out however that HTML never got anything like
a "flavor" parameter in its MIME type, and even if it did it'd not have
helped clear the mess in any way.

Ok so here is where I really want to focus and learn some stuff from the
Markdown community. I am a fairly heavy Markdown user, but not a Markdown
developer or maintainer [yet].

Sean Leonard

2014-07-09 20:08:41 UTC

The "flavor" parameter is a good idea in theory. [...] Nobody is going to annotate their file with the right flavor unless there's a tangible benefit[...]
[...] HTML never got anything like a "flavor" parameter in its MIME type, and even if it did it'd not have helped clear the mess in any way.

About this "flavors" thing. I know there are several lists floating out
there of different Markdown implementations and variants (or if you
don't like them being called Markdown, you can call them Illegitimate
Sons of Markdown™). Which list is the most complete? Can someone show me
(or make for the community) a really comprehensive list, and agree to
update it?

When I wrote the -00 draft, I tried to follow the Media Type
Registration Procedures. One requirement is to list required and
optional parameters. Parameters are defined in RFC 6838 as "companion
data". See RFC 6838 and in particular, Sections 1, 4.2.1, and 4.3.

All text/ types have at least one parameter: the charset. That is
because all text data has to be interpreted according to a code (i.e.,
character set) that converts the bits of data into useful information.
Nowadays we take Unicode (specifically UTF-8) for granted, but it's just
not the case in reality. You can't just open a text file and hope for
the best--you have to have /metadata/, express or implied, that tells
you how to handle the blob of bits. The very fact that it is textual
data has to be inferred from other things, such as the filename
extension (when the data is in a file). A filename is just another piece
of metadata.

When dealing with HTML, the charset could determined at least six ways:
1. as express external metadata, when the Content-Type has a charset
parameter in the HTTP header.
2. as implied external metadata, when the HTTP header is absent but the
client infers it from "other things" (e.g., the server, the IP address,
or by looking at the ccTLD).
3. as express internal "metadata", with <meta charset="iso-2022-jp"> or
<meta http-equiv="Content-Type" content="text/html;
charset=iso-2022-jp">; or in the case of XHTML, <?xml version="1.0"
encoding="iso-2022-jp"?>.
4. as express internal *data*, that is, the first bytes are 0xFF 0xFE
(likely UTF-16LE), 0xFE 0xFF (likely UTF-16BE), or 0xEF 0xBB 0xBF
(likely UTF-8).
5. as implied internal *data*, that is, "take the first 256 bytes and
try to see if it decodes to something approximating HTML soup using some
common character sets; if it fits, you quit".
6. as express user preference, that is, "I'm Japanese in Japan on a
Windows machine, therefore on my browser, just assume everything is
Shift-JIS".

See...there are all these crazy options...because nobody standardized on
the character set when HTTP/HTML was developed; people assumed it was
US-ASCII and then shoehorned lots of zany ways to make it something else.

At least with Markdown, we can probably safely eliminate #3 since
Markdown is not intended to generate the <head> part of (X)HTML.

The operating question is: What metadata (companion data) is /necessary/
to reflect the creator's intent with respect to the data?

With Markdown, I think the answer is: you need the character set, and
you need to know how to turn the text into HTML (or XHTML, PDF, RTF, MS
Word/Office Open XML, or whatever).

Markdown has no way to communicate the character set in the document
(other than the Unicode Byte Order Marks, which is a generalized
property about text streams, not specific to Markdown)--and it would be
counterproductive to invent one. So that is a perfect example of
relevant metadata. And the second one, is how to turn it into something
else that the author wants. If it's not communicated, it's going to be
implied. Implied means "guessing" and likely "guessing wrong".

Hopefully this makes sense. I want to be more educated about this. Thanks!

Sean

Michel Fortin

2014-07-09 22:07:19 UTC

The operating question is: What metadata (companion data) is /necessary/ to reflect the creator's intent with respect to the data?
With Markdown, I think the answer is: you need the character set, and you need to know how to turn the text into HTML (or XHTML, PDF, RTF, MS Word/Office Open XML, or whatever).

Indeed.

Markdown has no way to communicate the character set in the document (other than the Unicode Byte Order Marks, which is a generalized property about text streams, not specific to Markdown)--and it would be counterproductive to invent one. So that is a perfect example of relevant metadata.

Fun fact: PHP Markdown is mostly encoding agnostic. It understands UTF-8 sequences but any byte that is not a valid UTF-8 sequence is treated as a character in itself. It's only relevant when converting tabs into spaces however, and only if you have non-ASCII characters before the tab.

So whatever the input encoding is becomes the output's encoding (this works for HTML). Naturally, it's good to know the input's encoding if you want to know the output's. So obviously it's a good idea to specify the text encoding even though the parser itself doesn't need it, so you know the resulting document's encoding.

That's not really relevant though.

And the second one, is how to turn it into something else that the author wants. If it's not communicated, it's going to be implied. Implied means "guessing" and likely "guessing wrong".

Ideally you'd use the exact same version of the same parser the author used to interpret the document in the first place.

Or you could be loose and use another version of the same parser.
Or you could be loose and use another parser claiming to be of the same flavor.
Or you could be loose and use another parser claiming to be of a superset of the given flavor.
Or you could be loose and use another Markdown parser.

It's a spectrum. Each step down will increase the likeliness of something going wrong.

Hopefully this makes sense. I want to be more educated about this.

This makes perfect sense, but I fear there's no good answer to your second question. Since you want to know more, here's some insight.

It's important to understand that there is no notion of invalid Markdown input. As an implementer every time you fix what looks like a parsing bug to you or add a feature you're also breaking some valid input that was producing something else before. The implementer will usually only choose to break valid input that was deemed very unlikely to ever have been used before, but there's no way to know for sure (and no reliable way to measure impact either). So if you really really want to be sure things are parsed in the intended way, you should use the closest version possible of the same parser as the creator of the document was using.

Also, subtle changes can make things technically incompatible. For instance, Markdown Extra is mostly a superset of the original Markdown feature-wise, except for one small incompatible change: underscore emphasis within a word is disallowed. This was a deliberate change to fix some problems users were having with words that contained underscore. So even though most people would consider Markdown Extra as a superset of Markdown, it technically isn't. Other implementers might do the same thing but consider it as a bug fix instead and tell their users implementation implements the original syntax.

Babelmark 2 will tell you that implementations are pretty much evenly split on this:
http://johnmacfarlane.net/babelmark2/?normalize=1&text=word_with_emphasis

You'll even see that Pandoc implements both behaviour depending on whether you're in strict mode or not.

Something stranger happens with the shortcut reference syntax:
http://johnmacfarlane.net/babelmark2/?normalize=1&text=%5Blink%3F%5D%0A%0A%5Blink%3F%5D%3A+http%3A%2F%2Flink.x%2F

This is bad [sic].

[sIC]: http://sic.sickdomain

I sure wish things would be simpler. But as things are now, I have a hard time identifying what "flavor" could mean. Should "Markdown.pl-1.0.1" be a flavor on its own?

--
Michel Fortin
***@michelf.ca
http://michelf.ca

John MacFarlane

2014-07-10 05:04:44 UTC

Post by Michel Fortin
Fun fact: PHP Markdown is mostly encoding agnostic. It understands UTF-8 sequences but any byte that is not a valid UTF-8 sequence is treated as a character in itself. It's only relevant when converting tabs into spaces however, and only if you have non-ASCII characters before the tab.

Small amendment: There are at least two places where the difference
between utf-8 and latin1 matters: tab expansion (as you note) and
reference links, since these are stipulated to be case insensitive.
(Case conversion is sensitive to the encoding.)

Sean Leonard

2014-07-10 09:00:32 UTC

Post by Michel Fortin
Fun fact: PHP Markdown is mostly encoding agnostic. It understands
UTF-8 sequences but any byte that is not a valid UTF-8 sequence is
treated as a character in itself. It's only relevant when converting
tabs into spaces however, and only if you have non-ASCII characters
before the tab.

I haven't tried it yet, but I suspect PHP Markdown is mostly encoding
agnostic only for most encodings that preserve the US-ASCII range. Try
feeding it an EBCDIC-encoded file. The 0x20-0x3F codes in EBCDIC are not
even printable characters! :)

And speaking of UTF-8, fun fact: there is a UTF-EBCDIC encoding that
represents the whole Unicode repetoire in EBCDIC. See
<http://en.wikipedia.org/wiki/UTF-EBCDIC> and UTR #16.

-Sean

Michel Fortin

2014-07-10 11:48:18 UTC

I haven't tried it yet, but I suspect PHP Markdown is mostly encoding agnostic only for most encodings that preserve the US-ASCII range.

That's what I meant by "mostly encoding agnostic". It'll work with ASCII and most European encoding schemes because they are ASCII-compatible, but anything more fancy than that will have to be UTF-8.

--
Michel Fortin
***@michelf.ca
http://michelf.ca

Michel Fortin

2014-07-10 11:53:25 UTC

Like Markdown.pl, PHP Markdown will just treat non-ASCII characters in a case-sensitive way so in my case it doesn't matter.

Also, if you want to compare characters in a case-sensitive manner, the most correct way to do it is to use the Unicode Collation Algorithm, not case conversion to lower or uppercase, because some characters can't round-trip (see [german ß]). Then you'll notice that unfortunately Unicode collation is locale dependent (because equivalent characters aren't the same in all locales, see the [turkish ı]). And then you'll realize there's not correct way to do it universally.

[GERMAN SS]: https://en.wikipedia.org/wiki/ß
[TURKISH I]: https://en.wikipedia.org/wiki/Turkish_dotted_and_dotless_I

On Babelmark I see that cheapskate 0.1.0.1 understands the first link above -- good job! -- an no one understands the second one.

http://johnmacfarlane.net/babelmark2/?normalize=1&text=Also%2C+if+you+want+to+compare+characters+in+a+case-sensitive+manner%2C+the+most+correct+way+to+do+it+is+to+use+the+Unicode+Collation+Algorithm+--+not+case+conversion+to+lower+or+uppercase+--+because+some+characters+can't+round-trip+(see+%5Bgerman+ß%5D).+Then+you'll+notice+that+unfortunately+Unicode+collation+is+locale+dependent+(because+equivalent+characters+aren't+the+same+in+all+locales%2C+see+the+%5Bturkish+ı%5D).+And+then+you'll+realize+there's+not+really+a+correct+way+to+do+it.%0A%0A+%5BGERMAN+SS%5D%3A+https%3A%2F%2Fen.wikipedia.org%2Fwiki%2Fß%0A+%5BTURKISH+I%5D%3A+https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FTurkish_dotted_and_dotless_I%0A

--
Michel Fortin
***@michelf.ca
http://michelf.ca

John MacFarlane

2014-07-10 22:18:51 UTC

Like Markdown.pl, PHP Markdown will just treat non-ASCII characters in a case-sensitive way so in my case it doesn't matter.

I think this is a deficiency in Markdown.pl. The syntax description
says that reference links are case-insensitive, and it doesn't say
anything about this just applying to ascii references. I think someone
who writes in, say, Spanish, would be quite naturally expect words with
accents to behave the same as words without accents in reference links.

By the way, I'm not sure what the motivation for making the reference
links case-insensitive was. I conjecture that it was to allow the
following sort of thing:

[Foo][] is better than [bar][]. And [Bar][] is worse than [foo][].

[foo]: /url1
[bar]: /url2

This is a good motivation: it would be a burden to have to define
separate references for capitalized and uncapitalized versions of a
phrase, or to use the longer form `[Foo][foo]` for capitalized
versions. But this motivation extends naturally beyond ascii.

Hence, I think markdown processors *should* do a proper unicode
case fold in determining when references match.

Unfortunately, as you point out, this becomes very complex, and
brings in locale dependence for a few cases (e.g. Turkish). Still,
I think it's the ideal we should aspire to.

Carl Jacobsen

2014-07-10 10:46:23 UTC

Post by Michel Fortin
I sure wish things would be simpler. But as things are now, I have a hard time identifying what "flavor" could mean. Should "Markdown.pl-1.0.1" be a flavor on its own?

Perhaps it would be better to have *two* optional fields:

- A "processor" field, as in `processor="Markdown.pl-1.0.1"` or `processor="Pandoc-1.12.4.2-nostrict", which indicates that the sender was using (that program) to view the output and was satisfied with it. This could be included whenever known, and hopefully *not* relied upon by the recipient, but could provide useful clues in some cases where the recipient *has* to decide how to interpret something.

- A "flavor" field consisting of zero or more alphanumeric tokens, separated by "+" or "," or some such, declaring well-understood deviations from, or extensions to, the original standard. Not as a completely exhaustive list (you don't have to be able to indicate that your processor has, say, special syntax for animated gif backgrounds if no one else can use that anyway), but simply to promote better interoperability, to be interpreted as "the text contained herein supports basic markdown, plus *at least* X and Y and Z capabilities". Include some useful things like:

- "nofill" (hard returns should be obeyed rather than joining lines - this would clarify that "two github flavors" situation),
- "tables" (supports some agreed-upon "least common denominator" table syntax),
- "footnotes" (similar but for footnotes),
- "fencedcodeblocks" (you get the idea),
- "titleblock" (text lines before first blank line are some sort of metadata and shouldn't be displayed to casual viewers),
- "restrictunderscores" (mid-word underscores are not to be interpreted as starting/ending emphasized or strong text)

The idea being, merely having the "text/markdown" content type would not guarantee anything beyond "well, it's *some* sort of markdown flavored text", while specifying one or more of the attributes in the "flavor" field would indicate "yes, you can assume that if something looks like an X, it *is* an X" (where X is something from the list above). It doesn't necessarily mean you *will* find, say, a table, in the text, merely that if you see something that looks like a table, it likely is.

This way, senders can give a series of hints to make their text more clear to parse, using the "flavor" attribute, and can also declare what processor they were using to view/validate things on their end, in case the recipient feels they simply must know precisely how to interpret the text (at which point it's up to them to decide what, if any, conversion or special handling is necessary). And a recipient wouldn't need to consult an exhaustive list of processors, but simply look at the flavors attribute to see if it recognizes anything there as a useful hint (and render everything else naively).

So, you might have: content-type="text/markdown" flavor="tables+titleblock" processor="floobity-1.2.3"

We'd need an agreed-upon list of, "ingredients", to put in the "flavors" field, but, again, you wouldn't want to exhaustively list every variant of every extension that any markdown-ish processor has ever come up with, only perhaps a dozen cases to cover the most common extensions (if you need to be more exacting, go consult the processor field and take matters into your own hands).

Does this make sense to anyone else? (I probably should have gone to sleep a long time ago.)

Cheers,
Carl Jacobsen

Sreeharsha Mudivarti

2014-07-10 11:22:39 UTC

Post by Carl Jacobsen
So, you might have: content-type="text/markdown"
flavor="tables+titleblock" processor="floobity-1.2.3"

That looks fine, but it has too many Combinations.

I don't think Markdown should be standardised.

It is,

1) Incomplete as compared to asciidoc

2) Self-Contradicting

No general escaping

3) No formal grammar

It is typically implemented by regexes.
So unlike html, a bold phrase, can't span two lines.

In the meantime you can standardise asciidoc, multimarkdown, wikicreole
which are saner than the original markdown and they already have specs.

Makrdown dubs its formatting on email conventions.
That's not true, because email conventions are free-form ascii.
Consider this email thread, for example.

Cheers,
Harsha

Aristotle Pagaltzis

2014-07-10 12:18:54 UTC

Post by Carl Jacobsen
So, you might have: content-type="text/markdown" flavor="tables+titleblock" processor="floobity-1.2.3"

That’s very nice, except it will never happen.

• No user is going to annotate their files in such a way that this MIME
type would ever show up in the wild in such specificity, unless they
use special software which automatically records the relevant metadata
(rather than just a text editor + file system workflow).

• No implementor is going to write a Markdown processor that is actually
capable of dealing with this MIME type. (Except of course the author
of floobity 1.2.3 itself, in this example, of course – which is not
for any attempt at making it work in a generic way.)

• If the flavours refer to existing syntax extensions and modifications,
those are all only vaguely specified, and the implementations that
offer them are not likely to be changed to follow a more rigorous spec
of the respective syntax extension any time soon. So in practice these
flavour specs are no more well-defined than saying the document is
(some kind of) Markdown.

• Documents are generally written for a specific processor, not for some
particular combination of syntax extensions in the abstract. There are
a number of processors which implement half a dozen separate syntax
extensions. Documents written for these processors notionally employ
all of these syntax extensions. Will documents in e.g. GitHub-Flavored
Markdown always have to list the entire enchilada in their MIME type
(instead of just saying “GFM”)? Or do you force whatever part of the
infrastructure picks the MIME type to parse the document (and contain
knowledge of all possible syntax extensions) to figure out which of
them are actually used?

• I won’t make a separate bullet point for the endless effort of keeping
track of all the possible syntax extensions, which is going to be
a cat herder job, because that should be an obvious problem. But what
if someone wants to implement a new syntax extension – what process do
they have to go through before they can assign a truthful MIME type to
their documents?

Metadata is hard. Let’s go shopping.

Regards,

--
Aristotle Pagaltzis // <http://plasmasturm.org/>

John MacFarlane

2014-07-10 22:28:57 UTC

Post by Aristotle Pagaltzis
• No implementor is going to write a Markdown processor that is actually
capable of dealing with this MIME type. (Except of course the author
of floobity 1.2.3 itself, in this example, of course – which is not
for any attempt at making it work in a generic way.)

Pandoc can already do this. Quick demo:

pandoc -s -f markdown_strict+pandoc_title_block+tex_math_dollars -t markdown_strict+mmd_title_block+tex_math_double_backslash

% Demo
% John
% July 1, 2014

Here is some math: $e=mc^2$.

^D
author: John
date: July 1, 2014
title: Demo

Here is some math: \$e=mc^2\$.

Here's my current list of extensions/variations (from the pandoc
source code). Of course, it's nowhere near exhaustive:

Ext_footnotes -- ^ Pandoc/PHP/MMD style footnotes
| Ext_inline_notes -- ^ Pandoc-style inline notes
| Ext_pandoc_title_block -- ^ Pandoc title block
| Ext_yaml_metadata_block -- ^ YAML metadata block
| Ext_mmd_title_block -- ^ Multimarkdown metadata block
| Ext_table_captions -- ^ Pandoc-style table captions
| Ext_implicit_figures -- ^ A paragraph with just an image is a figure
| Ext_simple_tables -- ^ Pandoc-style simple tables
| Ext_multiline_tables -- ^ Pandoc-style multiline tables
| Ext_grid_tables -- ^ Grid tables (pandoc, reST)
| Ext_pipe_tables -- ^ Pipe tables (as in PHP markdown extra)
| Ext_citations -- ^ Pandoc/citeproc citations
| Ext_raw_tex -- ^ Allow raw TeX (other than math)
| Ext_raw_html -- ^ Allow raw HTML
| Ext_tex_math_dollars -- ^ TeX math between $..$ or $$..$$
| Ext_tex_math_single_backslash -- ^ TeX math btw $..$ \[..\]
| Ext_tex_math_double_backslash -- ^ TeX math btw \$..\$ \\[..\\]
| Ext_latex_macros -- ^ Parse LaTeX macro definitions (for math only)
| Ext_fenced_code_blocks -- ^ Parse fenced code blocks
| Ext_fenced_code_attributes -- ^ Allow attributes on fenced code blocks
| Ext_backtick_code_blocks -- ^ Github style ``` code blocks
| Ext_inline_code_attributes -- ^ Allow attributes on inline code
| Ext_markdown_in_html_blocks -- ^ Interpret as markdown inside HTML blocks
| Ext_markdown_attribute -- ^ Interpret text inside HTML as markdown
-- iff container has attribute 'markdown'
| Ext_escaped_line_breaks -- ^ Treat a backslash at EOL as linebreak
| Ext_link_attributes -- ^ MMD style reference link attributes
| Ext_autolink_bare_uris -- ^ Make all absolute URIs into links
| Ext_fancy_lists -- ^ Enable fancy list numbers and delimiters
| Ext_lists_without_preceding_blankline -- ^ Allow lists without preceding blank
| Ext_startnum -- ^ Make start number of ordered list significant
| Ext_definition_lists -- ^ Definition lists as in pandoc, mmd, php
| Ext_example_lists -- ^ Markdown-style numbered examples
| Ext_all_symbols_escapable -- ^ Make all non-alphanumerics escapable
| Ext_intraword_underscores -- ^ Treat underscore inside word as literal
| Ext_blank_before_blockquote -- ^ Require blank line before a blockquote
| Ext_blank_before_header -- ^ Require blank line before a header
| Ext_strikeout -- ^ Strikeout using ~~this~~ syntax
| Ext_superscript -- ^ Superscript using ^this^ syntax
| Ext_subscript -- ^ Subscript using ~this~ syntax
| Ext_hard_line_breaks -- ^ All newlines become hard line breaks
| Ext_ignore_line_breaks -- ^ Newlines in paragraphs are ignored
| Ext_literate_haskell -- ^ Enable literate Haskell conventions
| Ext_abbreviations -- ^ PHP markdown extra abbreviation definitions
| Ext_auto_identifiers -- ^ Automatic identifiers for headers
| Ext_ascii_identifiers -- ^ ascii-only identifiers for headers
| Ext_header_attributes -- ^ Explicit header attributes {#id .class k=v}
| Ext_mmd_header_identifiers -- ^ Multimarkdown style header identifiers [myid]
| Ext_implicit_header_references -- ^ Implicit reference links for headers
| Ext_line_blocks -- ^ RST style line blocks

Sreeharsha Mudivarti

2014-07-10 23:34:01 UTC

Post by John MacFarlane
Here is some math: \$e=mc^2\$.
Here's my current list of extensions/variations (from the pandoc
Ext_footnotes -- ^ Pandoc/PHP/MMD style footnotes
| Ext_inline_notes -- ^ Pandoc-style inline notes
| Ext_pandoc_title_block -- ^ Pandoc title block
| Ext_yaml_metadata_block -- ^ YAML metadata block

As much as admirable pandoc is, testing that is difficult.

Given that features can nested, the combinations is easily greater
than the number of lines in

https://github.com/jgm/pandoc/blob/master/tests/Tests/Readers/Markdown.hs

Markdown is an over-loaded term. It is safe to say that,
there are at-least two useful Markdowns

1) Plain

Useful for,

* Non-technical publishing
* Simple comment systems

( Markdown.pl )

2) Complete

* DocBook / Latex style publishing
* Editors

( MultiMarkdown )

A third type of Markdown can be labelled "Proprietary Markdown",
of which Github Markdown is a prime example.

Markdown implementations can give pre-processors and post-processors,
which can implement "Proprietary Extensions".

I don't understand why different things have to be conflated together.

If the IETF draft is from the perspective of publishing 1) can be ignored.

John MacFarlane

2014-07-10 22:23:53 UTC

Post by Carl Jacobsen

Post by Michel Fortin
I sure wish things would be simpler. But as things are now, I have a hard time identifying what "flavor" could mean. Should "Markdown.pl-1.0.1" be a flavor on its own?

- A "processor" field, as in `processor="Markdown.pl-1.0.1"` or `processor="Pandoc-1.12.4.2-nostrict", which indicates that the sender was using (that program) to view the output and was satisfied with it. This could be included whenever known, and hopefully *not* relied upon by the recipient, but could provide useful clues in some cases where the recipient *has* to decide how to interpret something.
- "nofill" (hard returns should be obeyed rather than joining lines - this would clarify that "two github flavors" situation),
- "tables" (supports some agreed-upon "least common denominator" table syntax),
- "footnotes" (similar but for footnotes),
- "fencedcodeblocks" (you get the idea),
- "titleblock" (text lines before first blank line are some sort of metadata and shouldn't be displayed to casual viewers),
- "restrictunderscores" (mid-word underscores are not to be interpreted as starting/ending emphasized or strong text)

I already made an attempt in pandoc to factor out some of these
dimensions of variability (so that different markdown flavors can be
converted to each other). You can specify, e.g.,
markdown+pipe_tables+footnotes-tex_math_dollars.

See http://johnmacfarlane.net/pandoc/README.html#pandocs-markdown for
a list.

However, there are hundreds of more dimensions on which markdown
implementations vary (I could go on and on). A crude list of
extensions/variations might be helpful, but I don't think you
could get close to a complete list.

Sean Leonard

2014-07-11 08:54:43 UTC

So this thread has a lot of content, and is leading me to revise the
proposal a few different ways. Thanks everyone thus far; it has been
very educational.

I would like to ask the community here with a basic question, so I can
start to reason out from there.

It seems that there is a general consensus that Markdown is an
open-ended informal family of syntaxes based on John Gruber's original
work. Everything in some way traces back to the original Markdown.pl
script and syntax specification circa 2004. If it does not or cannot
trace back, it's not Markdown--it's something else (e.g., reStructuredText).

At the same time, the proliferation of variations, extensions, fixes,
tweaks, and everything else has led to a staunch lack of consensus on
what constitutes "Standard Markdown", i.e., the Markdown in 2014 that
/everyone ought to follow/. This is in contrast to, say, XML or
HTML--with XML there are very strict standards of what one ought to
follow; with HTML it's much more open-ended but at least there is one
organization (W3C) and one "living standard" (HTML5) where people can
glom on their kitchen-sink proposals.

Since we cannot reach consensus on what ought to be "Standard Markdown"
today, can the community reach consensus on "Historical Markdown"--of
which I propose three working definitions?

* Classic Markdown: The Markdown syntax or Markdown.pl implementation,
as implemented by John Gruber, in 1.0.1, with all ambiguities, bugs,
frustrations, and contradictions. [In cases that the syntax and the tool
contradict, we come up with a way to resolve the contradictions.]

* Original Markdown: The Markdown syntax or Markdown.pl implementation,
as implemented by John Gruber, in 1.0.2b7, with as many of the
ambiguities, bugs, frustrations, and contradictions fixed as he actually
fixed (or failed to fix) them. Aka "Markdown Web Dingus".

* Idealized Markdown (aka Historical Standard Markdown): The Markdown
that everyone can agree is the way Markdown "should have been" back when
there was One True Markdown. Basically this is Original Markdown with
its faults duly recognized and corrected...many of these faults having
been corrected in practice in divergent implementations (Markdown Extra
etc.) but never officially recognized in Original Markdown.

I cannot say which of these three is better...but by recognizing these
three as common points, we can then start to compare on the same page.

-Sean

Michel Fortin

2014-07-11 10:04:45 UTC

Since we cannot reach consensus on what ought to be "Standard Markdown" today, can the community reach consensus on "Historical Markdown"--of which I propose three working definitions?
* Classic Markdown: The Markdown syntax or Markdown.pl implementation, as implemented by John Gruber, in 1.0.1, with all ambiguities, bugs, frustrations, and contradictions. [In cases that the syntax and the tool contradict, we come up with a way to resolve the contradictions.]
* Original Markdown: The Markdown syntax or Markdown.pl implementation, as implemented by John Gruber, in 1.0.2b7, with as many of the ambiguities, bugs, frustrations, and contradictions fixed as he actually fixed (or failed to fix) them. Aka "Markdown Web Dingus".
* Idealized Markdown (aka Historical Standard Markdown): The Markdown that everyone can agree is the way Markdown "should have been" back when there was One True Markdown. Basically this is Original Markdown with its faults duly recognized and corrected...many of these faults having been corrected in practice in divergent implementations (Markdown Extra etc.) but never officially recognized in Original Markdown.
I cannot say which of these three is better...but by recognizing these three as common points, we can then start to compare on the same page.

You might also call the first two "Markdown 1.0.1" and "Markdown 1.0.2b7" for simplicity's sake. As for the idealized version, that's what I call "Markdown" personally, or "plain Markdown" when I need to disambiguate.

Wasn't 1.0.2b8 the last one though? Why is the Dingus running 1.0.2b7? Babelmark 2 has 1.0.2b8.

--
Michel Fortin
***@michelf.ca
http://michelf.ca

Sean Leonard

2014-07-11 10:06:22 UTC

You might also call the first two "Markdown 1.0.1" and "Markdown 1.0.2b7" for simplicity's sake. As for the idealized version, that's what I call "Markdown" personally, or "plain Markdown" when I need to disambiguate.
Wasn't 1.0.2b8 the last one though? Why is the Dingus running 1.0.2b7? Babelmark 2 has 1.0.2b8.

http://daringfireball.net/projects/markdown/dingus

says "1.0.2b7". Not sure what's up with that.

-Sean

Sean Leonard

2014-07-11 10:08:14 UTC

Ok; however, I understand that there are some differences between the
syntax <http://daringfireball.net/projects/markdown/syntax> and the
1.0.1 implementation. Maybe also the 1.0.2b[x] implementation(s). Right?

-Sean

Michel Fortin

2014-07-11 11:30:40 UTC

Post by Michel Fortin
You might also call the first two "Markdown 1.0.1" and "Markdown 1.0.2b7" for simplicity's sake. As for the idealized version, that's what I call "Markdown" personally, or "plain Markdown" when I need to disambiguate.

Ok; however, I understand that there are some differences between the syntax <http://daringfireball.net/projects/markdown/syntax> and the 1.0.1 implementation. Maybe also the 1.0.2b[x] implementation(s). Right?

In the 1.0.2 beta branch the HTML block parser supports the markdown="1" attribute, but also introduces some regressions; the shortcut reference links were added; there has been some hacky bug fixing regarding code spans-like things in the attributes of HTML tags (but I'll argue it's are just shifting the errors to somewhere else). The version history is right there if you want the differences between 1.0.1 and 1.0.2b[x] (looks like someone posted 1.0.2b8 on Github for convenience):
<https://github.com/mayoff/Mathdown/blob/master/Markdown.pl#L1529>

The syntax page is documenting the 1.0.1 features. Parsing of list indentation doesn't work exactly as described in that document however. First point of this answer in the Babelmark 2 FAQ gives more details:
<http://johnmacfarlane.net/babelmark2/faq.html#what-are-some-big-questions-that-the-markdown-spec-does-not-answer>

Beside that, the document makes many simplifications to make it easier to understand from a user perspective. It is not really an implementer's guide.

--
Michel Fortin
***@michelf.ca
http://michelf.ca

John MacFarlane

2014-07-11 23:20:16 UTC

Post by Sean Leonard
Since we cannot reach consensus on what ought to be "Standard
Markdown" today, can the community reach consensus on "Historical
Markdown"--of which I propose three working definitions?

I think the only sensible thing to refer to is John Gruber's Markdown
syntax description, which is the canonical reference (even if it is very
incomplete on the details). Markdown.pl 1.0.1 and 1.0.2b7 are both
buggy implementations. Neither one is faithful to the syntax
description.

To give just one example: the syntax description says,
"Each subsequent paragraph in a list item must be indented by either 4
spaces or one tab." But neither version of Markdown.pl actually
imposes this requirement:

http://johnmacfarlane.net/babelmark2/?normalize=1&text=+-+item%0A%0A+more%0A%0A+-+new+item%0A

Waylan Limberg

2014-07-12 02:32:39 UTC

Since we cannot reach consensus on what ought to be "Standard Markdown" today, can the community reach consensus on "Historical Markdown"--of which I propose three working definitions?

I agree. There is one markdown -- the syntax rules. While there may be many implementations, they are all buggy -- whether intentional or not.

Actually, I might be persuaded that there there is two: the rules, and "extended markdown" -- which would be all intentional deviations from the rules. If your documents are "markdown" then they strictly follow the rules and mostly likely will be parsed by all markdown parsers the same way. However, if your document is "extended markdown", then all bets are off. Such a label is in effect saying; "Hey, this text document represents markdown text, but may not strictly be pure markdown text. Weird things may happen. Consider yourself warned." Beyond that, I see no need to specify anything further.

Waylan Limberg

Sean Leonard

2014-07-12 14:32:06 UTC

As I'm thinking about this, I have other questions:

Can a Markdown parser/processor fail? Is there a concept of Markdown
validity--i.e., can Markdown content be invalid (from the perspective of
Markdown, not (X)HTML)?

As I understand it:
A Markdown processor identifies Markdown control sequences (aka
markdown, in lowercase) in a stream of text and converts these sequences
to the target markup--namely (X)HTML.
A Markdown processor identifies (X)HTML in markdown and passes this
content to the target markup.
<-- Do Markdown processors (i.e., existing implementations) attempt to
fix or normalize the markup (by deserializing and then reserializing the
markup), or is it a straight pass? It sounds like whether or not a
Markdown processor reserializes the markup is implementation-dependent;
Gruber's syntax rules do not say. However, if you have Markdown in the
HTML content with markdown="1" as with PHP Markdown Extra, it is
necessary to parse the HTML with something other than a straight HTML
parser since the straight HTML parser will misinterpret the Markdown
(e.g., & will be a validation error).

Therefore:
Markdown has no concept of markdown validity. A Markdown processor never
fails due to invalid markdown input. If a sequence of text is not
recognized as markdown (i.e., control sequences), it is treated as text
and passed accordingly to the target markup. (This property is directly
related to the "degradation" feature of Markdown, namely, if your
processor cannot understand the markdown, the output is "worse" than an
author intended, but does not cause utter failure--the non-understood
markdown is visible in the output. This is in contrast to HTML, where
tags or attributes that are not understood have no effect on the
presentation of the HTML.)

Markdown may have a concept of HTML validity. A Markdown processor that
identifies HTML in Markdown content may determine that the HTML is valid
or invalid. For example, it may identify <div> ... [end of document] as
HTML that is invalid because it lacks a closing </div> tag. Then, it has
five choices:
1. treat the invalid HTML as text--pass the text-as-text to the markup
(i.e., turn & into & , < into < , etc.)
2. treat the invalid HTML as Markdown--keep on processing the input and
look for markdown inside of it (thus *hello* inside the invalid HTML
will get marked up...and <div><a
href="http://www.example.com/">hello</a>[end of document] will become a
real link with the literal text '<div>' preceding it)
<-- this is the same behavior as "not identifying the text as HTML in
the first place"
3. pass the invalid HTML as HTML
4. attempt to fix the HTML...thus <div><a
href="http://www.example.com/">hello</a>[end of document] might become
<div><a href="http://www.example.com/">hello</a></div>
5. fail due to HTML invalidity

?

Sean

Aristotle Pagaltzis

2014-07-12 16:08:14 UTC

However, if you have Markdown in the HTML content with markdown="1" as
with PHP Markdown Extra, it is necessary to parse the HTML with
something other than a straight HTML parser since the straight HTML
parser will misinterpret the Markdown (e.g., & will be a validation
error).

That parser is Markdown itself. You can already put Markdown inside
HTML tags, it’s just that normally Markdown will only parse the content
of inline tags like EM and SPAN, not block tags like P or DIV. This was
an explicit design choice. The markdown="1" attribute does nothing more
than turn off this distinction temporarily.

(The block tag rule allows you to write portions of your document as
plain old HTML when Markdown is insufficient, and also allows you to
pass stuff through Markdown several times (e.g. fragments in a CMS
getting passed through Markdown at various stages of page assembly)
without screwing up the document. I consider it the smartest choice in
the design of Markdown: the reason it has been adopted where other
syntaxes have remained confined to niches. It means almost any HTML
fragment is also a Markdown fragment, so it’s easy to add Markdown to
any publishing workflow that involves HTML somewhere even if it wasn’t
designed for that at all, and the content can then be ported piecemeal
instead of boil-the-ocean. Classic embrace-and-extend.)

Markdown has no concept of markdown validity.

Correct.

Markdown may have a concept of HTML validity.

Not really. Individual processors may, but Markdown itself has nothing
to say about that. The original implementation of course is implemented
as a text substitution system, which means if you give it Markdown that
contains invalid HTML then you’ll simply get HTML that’s invalid in the
same way, to then be interpreted by the browser however the browser may.
My guess is that the majority of implementations behave equivalently to
this, though depending on their design they could differ completely.

Regards,

--
Aristotle Pagaltzis // <http://plasmasturm.org/>

Michel Fortin

2014-07-12 18:52:21 UTC

1. treat the invalid HTML as text--pass the text-as-text to the markup (i.e., turn & into & , < into < , etc.)
2. treat the invalid HTML as Markdown--keep on processing the input and look for markdown inside of it (thus *hello* inside the invalid HTML will get marked up...and <div><a href="http://www.example.com/">hello</a>[end of document] will become a real link with the literal text '<div>' preceding it)
<-- this is the same behavior as "not identifying the text as HTML in the first place"
3. pass the invalid HTML as HTML
4. attempt to fix the HTML...thus <div><a href="http://www.example.com/">hello</a>[end of document] might become <div><a href="http://www.example.com/">hello</a></div>
5. fail due to HTML invalidity
?

Is that really a question?

1. Turning `&` and `<` into `&` and `<` is part of the official syntax rules. Hopefully every Markdown parser does that.

2. 3. 4. 5. We have implementations doing all of that, probably mixing a few of those solutions depending on the exact error.

When you have a question like this, just try it Babelmark 2:
http://johnmacfarlane.net/babelmark2/?normalize=1&text=%3Cdiv%3E

--
Michel Fortin
***@michelf.ca
http://michelf.ca

Waylan Limberg

2014-07-12 19:31:05 UTC

[snip]
http://johnmacfarlane.net/babelmark2/?normalize=1&text=%3Cdiv%3E

Yes, that's what we all do. And to answer your other question, notice that only two of the implementations on Babelmark2 failed. Remember, most of these implementations were written to be run on web servers. We can't have our web servers crashing just because a user submitted invalid markdown. What a parser doesn't understand is just passes through. What it misunderstands is garbles but it is specifically designed to never choke.

As Michel alluded to, most parsers are simply a series of regular expression substitutions which are run in a predetermined order. If a regex never matches a part of the text, then that part passes through untouched. Yes, that means the HTML is parsed by regex - which we all know is a bad idea -- but it is not really parsed in the way that browsers parse HTML. The regex just finds anything surrounded by angle brackets and ignores it. With the exception of the limited block level stuff, we don't even care if there are opening and/or closing tags. Yes, that can result in improperly nested stuff, but that is the authors fault and the parser should not bring the whole server down for that. The Author can (should?) preview in a browser and fix it before publishing.

However, I should point out that while the above describes most parsers (as most are more or less direct ports of markdown.pl - which works this way), there are a few that use other methods under the hood. For example, a few generate a parse tree which is then fed into a renderer (I believe Pandoc works like that, which allows it to output many more formats than just HTML), but they are the rare exception.

Waylan

Sean Leonard

2014-07-12 22:23:51 UTC

Post by Waylan Limberg

[snip]
http://johnmacfarlane.net/babelmark2/?normalize=1&text=%3Cdiv%3E

Yes, that's what we all do. And to answer your other question, notice that only two of the implementations on Babelmark2 failed. Remember, most of these implementations were written to be run on web servers. We can't have our web servers crashing just because a user submitted invalid markdown. What a parser doesn't understand is just passes through. What it misunderstands is garbles but it is specifically designed to never choke.
As Michel alluded to, most parsers are simply a series of regular expression substitutions which are run in a predetermined order. If a regex never matches a part of the text, then that part passes through untouched. Yes, that means the HTML is parsed by regex - which we all know is a bad idea -- but it is not really parsed in the way that browsers parse HTML. The regex just finds anything surrounded by angle brackets and ignores it. With the exception of the limited block level stuff, we don't even care if there are opening and/or closing tags. Yes, that can result in improperly nested stuff, but that is the authors fault and the parser should not bring the whole server down for that. The Author can (should?) preview in a browser and fix it before publishing.
However, I should point out that while the above describes most parsers (as most are more or less direct ports of markdown.pl - which works this way), there are a few that use other methods under the hood. For example, a few generate a parse tree which is then fed into a renderer (I believe Pandoc works like that, which allows it to output many more formats than just HTML), but they are the rare exception.

I see.

Here is a real-world example of what I was citing:
http://johnmacfarlane.net/babelmark2/?text=Hello+I+am+some+*text*.%0A%3Cdiv%3EHello+%3Ca+href%3D%22http%3A%2F%2Fwww.example.com%2F%22%3Ethat+is+nice%3C%2Fa%3E+chance+%26+circumstance%26hellip%3B%0A%0AThe+end.

Truly, it looks like there is great diversity in Markdown-land.

Ok, so any standard mentioning Historical Markdown cannot say that any
particular behavior is normative when it comes to HTML validity. Some
check for HTML (island) validity and behave differently; others don't.
The end...I guess.

Sean

Waylan Limberg

2014-07-12 23:48:49 UTC

Post by Sean Leonard

Post by Waylan Limberg

[snip]
http://johnmacfarlane.net/babelmark2/?normalize=1&text=%3Cdiv%3E

Yes, that's what we all do. And to answer your other question, notice that only two of the implementations on Babelmark2 failed. Remember, most of these implementations were written to be run on web servers. We can't have our web servers crashing just because a user submitted invalid markdown. What a parser doesn't understand is just passes through. What it misunderstands is garbles but it is specifically designed to never choke.
As Michel alluded to, most parsers are simply a series of regular expression substitutions which are run in a predetermined order. If a regex never matches a part of the text, then that part passes through untouched. Yes, that means the HTML is parsed by regex - which we all know is a bad idea -- but it is not really parsed in the way that browsers parse HTML. The regex just finds anything surrounded by angle brackets and ignores it. With the exception of the limited block level stuff, we don't even care if there are opening and/or closing tags. Yes, that can result in improperly nested stuff, but that is the authors fault and the parser should not bring the whole server down for that. The Author can (should?) preview in a browser and fix it before publishing.
However, I should point out that while the above describes most parsers (as most are more or less direct ports of markdown.pl - which works this way), there are a few that use other methods under the hood. For example, a few generate a parse tree which is then fed into a renderer (I believe Pandoc works like that, which allows it to output many more formats than just HTML), but they are the rare exception.

I see.
http://johnmacfarlane.net/babelmark2/?text=Hello+I+am+some+*text*.%0A%3Cdiv%3EHello+%3Ca+href%3D%22http%3A%2F%2Fwww.example.com%2F%22%3Ethat+is+nice%3C%2Fa%3E+chance+%26+circumstance%26hellip%3B%0A%0AThe+end.
Truly, it looks like there is great diversity in Markdown-land.
Ok, so any standard mentioning Historical Markdown cannot say that any particular behavior is normative when it comes to HTML validity. Some check for HTML (island) validity and behave differently; others don't. The end...I guess.

Yes, but select "normalize" (which normalizes insignificant white space in the output), and the number of variations decreases. Unfortunately, there is absolutely no standardization in how the various implementations handle white space (I don't think I've seen two that match exactly in every corner case). Either way though, hit the "preview" button (top right of output) to see how the browser renders the output and all but a couple render in the browser exactly the same.

And that is what makes markdown so great. You don't need to know or understand HTML to write it if you are using markdown. And if you have only an elementary knowledge of HTML, you can break into HTML on those few occasions when markdown won't do what you need.

Waylan

Sean Leonard

2014-07-15 07:20:27 UTC

Thank you all for the informative feedback and comments.

Let me get to the punchline. Now having a much better understanding of
the extraordinary diversity of Markdown expressions that are out there,
I think that the "flavor" parameter does not make sense. Instead let me
introduce proposal rev 2, which includes two optional parameters:
variants and processor. (This is very similar to Carl Jacobsen's
proposal--the main differences being that I am adding more formality.)

[This is not specification text, but something like it might appear in
draft -01. For the sake of this post, I am avoiding explicit discussion
of syntax.]

Parameters are defined in RFC 6838 as "companion data", that is, data
that assists with the meaning or interpretation. Parameters can be
"advisory" (derived from the content--thus allowing a consumer to avoid
parsing the content), "tangential" (informational but not affecting the
interpretation of the content), or "material" (has a material effect on
how the content is interpreted). In the case of Markdown, the processor
and variants parameters are material in that they reflect the author's
intent on how best to interpret the content. If absent, the author
expresses no opinion on how to interpret the content; a recipient can
use any Markdown workflow, including a workflow of the recipient's
choice, or a workflow inferred from the broader context (e.g., a build
script for a group of Markdown files).

***

processor: The processor parameter identifies a specific Markdown
implementation and the arguments to be fed to the processor. The
processor parameter has three sub-parameters:
1. Processor name. This is the common-sense, unambiguous name of the
processor. For example, John Gruber's implementation would be called
"Markdown.pl"; pandoc would be called "pandoc".
(Optional) 2. Version. If specified, this is the version of the
processor tool. For example, the Markdown.pl processor could have
version 1.0.1 or 1.0.2b8.
(Optional) 3. Processor-specific arguments. If specified, these
arguments would be used with the processor. Each processor gets to
define the meaning of its arguments; processors that are not
command-line based (e.g., a C library) shall define a mapping between
the argument strings and programmatic parameters to be used when
invoking the processor.

IANA would create a sub-registry of processors. Each registry entry must
contain the processor name (identifier), the full name of the tool (if
it differs from the processor name), the authors or maintainers, and any
URL or other address at which to locate the processor tool and
documentation. Optionally, versions and processor-specific arguments can
be documented in the registry entry.

***

variants [could also be called rulesets or rules]: The variants
parameter identifies sets of rules ("rulesets") that formally specify
how to turn Markdown control characters into markup. The variants
parameter is an ordered list of rulesets. A ruleset is an identifier of
a set of rules. When multiple rulesets are included in the variants
parameter, they are stacked on top of each other. A rule that directly
contradicts a prior rule (mentioned earlier in the list) gets overruled.
The definition of a ruleset can include not only specific rules, but
also other rulesets. Therefore, there can be a ruleset whose primary
purpose is to group together several rulesets.

There is a semantic difference between an absent variants parameter, and
an empty variants parameter (variants=""). An absent variants parameter
means that the author has not expressed a preference or intent for how
to interpret particular Markdown control sequences. An empty variants
parameter means that the author intends for the Markdown rules of John
Gruber's syntax <http://daringfireball.net/projects/markdown/syntax> (as
of the publication of this document) to apply. Gruber's syntax (also
called the "baseline") leaves many cases ambiguous, contradictory, or
unsatisfactory. These gripes are inherent to Markdown's evolution, and
therefore, MUST stay as-is. That is, two different Markdown processors
can claim to conform to the baseline and produce wildly different output.

Examples of variants: the extensions included in pandoc such as
"line_blocks", "fenced_code_blocks", and "strict".

IANA would create a sub-registry of rulesets for the variants parameter.
Each registry entry must include the ruleset identifier, a formal
description of the rules, and identification of included rulesets.
Optionally the entry may describe processors (including versions and
arguments) that are known to implement the ruleset.

Each ruleset identifier shall uniquely identify that set of rules. I.e.,
if "fenced_code_blocks" is registered, "guarded_code_blocks" cannot be
registered if the effective rules in "guarded_code_blocks" are the same
as "fenced_code_blocks".

***

When both variants and processor are present, processor takes
precedence. I.e., the processor choice is considered the best expression
of the author's intent.

Comments welcome.

-Sean

Fletcher T. Penney

2014-07-15 12:58:08 UTC

Not to be a wet blanket, but this feels like a solution in search of a problem to me. Maybe I just don't understand it.

Who will be putting in the effort required to make something practical happen out of all the work you put into creating the specification?

As a user, I go to a web site. I click on a link and it sends me a text file written in Markdown or something descended from it. I was expecting HTML because I like "pretty". No worries, the beauty of Markdown is that it makes perfect sense as is. If I want to view it as HTML, then most likely the person creating the website already did that for me. But if not, no problem, I just process it with my preferred flavor of Markdown. Hell, on a Mac you just save it as a file with the ".md" extension, or something similar, and you can preview it right in the finder if you install a QuickLook generator or use an editor that comes with one built in. I don't anticipate many people installing lots of Markdown variants just so they can use the same one the author used.

But even if they did, what web browser developer is going to support it? Do we envision the Chrome/Firefox/whatever teams bundling 30 different Markdown processors inside their apps so they can accurately preview text as HTML, when a key feature of the Markdown text format is that it looks pretty good all by itself?

I guess I'm not clear on what the end result of all this is going to be, in real-life practical terms. I see why getting "text/markdown" might be nice, but I'm not sure where it would have an actual impact. Perhaps some day it might. But it seems much harder to envision that it will ever be useful to the average Markdown user to get back a bunch of information about variant, version, command line arguments, etc.

That said, if a bunch of people want to spend a lot of time creating a specification that ends up not being used for very much, who am I to stop them? And again, maybe I'm wrong and this will in fact be the Next Big Thing..

F-
--
Fletcher T. Penney

Post by Sean Leonard
Thank you all for the informative feedback and comments.
Let me get to the punchline. Now having a much better understanding of the extraordinary diversity of Markdown expressions that are out there, I think that the "flavor" parameter does not make sense. Instead let me introduce proposal rev 2, which includes two optional parameters: variants and processor. (This is very similar to Carl Jacobsen's proposal--the main differences being that I am adding more formality.)
[This is not specification text, but something like it might appear in draft -01. For the sake of this post, I am avoiding explicit discussion of syntax.]
Parameters are defined in RFC 6838 as "companion data", that is, data that assists with the meaning or interpretation. Parameters can be "advisory" (derived from the content--thus allowing a consumer to avoid parsing the content), "tangential" (informational but not affecting the interpretation of the content), or "material" (has a material effect on how the content is interpreted). In the case of Markdown, the processor and variants parameters are material in that they reflect the author's intent on how best to interpret the content. If absent, the author expresses no opinion on how to interpret the content; a recipient can use any Markdown workflow, including a workflow of the recipient's choice, or a workflow inferred from the broader context (e.g., a build script for a group of Markdown files).
***
1. Processor name. This is the common-sense, unambiguous name of the processor. For example, John Gruber's implementation would be called "Markdown.pl"; pandoc would be called "pandoc".
(Optional) 2. Version. If specified, this is the version of the processor tool. For example, the Markdown.pl processor could have version 1.0.1 or 1.0.2b8.
(Optional) 3. Processor-specific arguments. If specified, these arguments would be used with the processor. Each processor gets to define the meaning of its arguments; processors that are not command-line based (e.g., a C library) shall define a mapping between the argument strings and programmatic parameters to be used when invoking the processor.
IANA would create a sub-registry of processors. Each registry entry must contain the processor name (identifier), the full name of the tool (if it differs from the processor name), the authors or maintainers, and any URL or other address at which to locate the processor tool and documentation. Optionally, versions and processor-specific arguments can be documented in the registry entry.
***
variants [could also be called rulesets or rules]: The variants parameter identifies sets of rules ("rulesets") that formally specify how to turn Markdown control characters into markup. The variants parameter is an ordered list of rulesets. A ruleset is an identifier of a set of rules. When multiple rulesets are included in the variants parameter, they are stacked on top of each other. A rule that directly contradicts a prior rule (mentioned earlier in the list) gets overruled. The definition of a ruleset can include not only specific rules, but also other rulesets. Therefore, there can be a ruleset whose primary purpose is to group together several rulesets.
There is a semantic difference between an absent variants parameter, and an empty variants parameter (variants=""). An absent variants parameter means that the author has not expressed a preference or intent for how to interpret particular Markdown control sequences. An empty variants parameter means that the author intends for the Markdown rules of John Gruber's syntax <http://daringfireball.net/projects/markdown/syntax> (as of the publication of this document) to apply. Gruber's syntax (also called the "baseline") leaves many cases ambiguous, contradictory, or unsatisfactory. These gripes are inherent to Markdown's evolution, and therefore, MUST stay as-is. That is, two different Markdown processors can claim to conform to the baseline and produce wildly different output.
Examples of variants: the extensions included in pandoc such as "line_blocks", "fenced_code_blocks", and "strict".
IANA would create a sub-registry of rulesets for the variants parameter. Each registry entry must include the ruleset identifier, a formal description of the rules, and identification of included rulesets. Optionally the entry may describe processors (including versions and arguments) that are known to implement the ruleset.
Each ruleset identifier shall uniquely identify that set of rules. I.e., if "fenced_code_blocks" is registered, "guarded_code_blocks" cannot be registered if the effective rules in "guarded_code_blocks" are the same as "fenced_code_blocks".
***
When both variants and processor are present, processor takes precedence. I.e., the processor choice is considered the best expression of the author's intent.
Comments welcome.
-Sean
_______________________________________________
Markdown-Discuss mailing list
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Michel Fortin

2014-07-15 12:59:10 UTC

Post by Sean Leonard
IANA would create a sub-registry of processors. Each registry entry must contain the processor name (identifier), the full name of the tool (if it differs from the processor name), the authors or maintainers, and any URL or other address at which to locate the processor tool and documentation. Optionally, versions and processor-specific arguments can be documented in the registry entry.

...

Post by Sean Leonard
IANA would create a sub-registry of rulesets for the variants parameter. Each registry entry must include the ruleset identifier, a formal description of the rules, and identification of included rulesets. Optionally the entry may describe processors (including versions and arguments) that are known to implement the ruleset.
Each ruleset identifier shall uniquely identify that set of rules. I.e., if "fenced_code_blocks" is registered, "guarded_code_blocks" cannot be registered if the effective rules in "guarded_code_blocks" are the same as "fenced_code_blocks".

But how does a document get annotated with the attributes in the first place? Who chooses the processor and variant attributes of a document and based on what? And where is it stored? Do you have any specific example of how that could work in any given setup?

My impression is that all this is going to do is define some metadata flags that no one will use.

What is the goal here? Is the goal to have most Markdown documents on the internet be annotated in this way so some browser software can pick automatically a sort-of compatible implementation for a given document? Or is it a way to have inside a given system (a CMS for instance) a way to annotate which Markdown implementation to use internally to parse a specific document?

--
Michel Fortin
***@michelf.ca
http://michelf.ca

Sean Leonard

2014-07-15 13:26:14 UTC

...

I am working on all of that.

The author chooses the processor and variant attributes; or, the
author's editing software will do this for the author. For example, a
tool like MarkdownPad can save out this metadata in the "right place". I
put it in quotes because I know that is an issue. One thing obvious
(from the metadata sub-thread) is that it cannot be stored in a generic
Markdown file in a broadly compatible way--I am thinking of something
adjacent.

If it is in a version control system like Subversion, or a CMS, then it
could be stored in the properties/attributes. If it is in an e-mail (in
particular, an e-mail generated by a CMS, see below), then it can be
stored in the usual MIME way.

I am trying not to invent another metadata format, so I am still looking
at the existing options out there.

Post by Michel Fortin
My impression is that all this is going to do is define some metadata flags that no one will use.
What is the goal here? Is the goal to have most Markdown documents on the internet be annotated in this way so some browser software can pick automatically a sort-of compatible implementation for a given document? Or is it a way to have inside a given system (a CMS for instance) a way to annotate which Markdown implementation to use internally to parse a specific document?

Definitely the latter--for a system like a CMS to store the Markdown
content with metadata, so that it can parse a specific document in a
specific way. Perhaps more importantly than storage, it is meant for
interchange--like when you export content from one CMS to another CMS.
Presumably, most CMSes will use one parser for its (public)-facing
implementation. In that case the parameters are implied. But when you
export data from that CMS (and import it into another CMS), it would be
very useful to record what Markdown features were used, so that the new
CMS can interpret the data the ways in which the original users intended
for it to be understood.

For example, take fenced code blocks. Your old CMS supported fenced code
blocks; the new one does not (for security reasons or because it's not
germane to the purpose of the CMS). Or maybe your old CMS supported
Fancy Tables Type #1 and the new one supports Fancy Tables Type #2.
Well, when you import your data into the new CMS, the new CMS can see
that its preferred Markdown processor is going to mangle the content, so
as part of the import process, it invokes the Markdown processor in the
metadata, converting the fenced code blocks to HTML (or the Fancy Tables
Type #1 to HTML). Then the content is going to look as the users
intended, but you don't have to maintain two contradictory
implementations in the new CMS. The Markdown processor for the imported
data can be invoked "offline" (i.e., as part of the bulk-import
process). This also alleviates security concerns since the import
process can be operated in another VM.

Sean

Fletcher T. Penney

2014-07-15 14:40:21 UTC

Post by Sean Leonard

Post by Michel Fortin
What is the goal here? Is the goal to have most Markdown documents on
the internet be annotated in this way so some browser software can
pick automatically a sort-of compatible implementation for a given
document? Or is it a way to have inside a given system (a CMS for
instance) a way to annotate which Markdown implementation to use
internally to parse a specific document?

It seems, then, that the place to focus effort is to create a CMS that
is "multilingual" when it comes to Markdown variants.

My suspicion is that no current CMS developer is going to go out of
their way to modify their system so that it can automatically support
multiple flavors of Markdown. Nor do I think that is a particularly
useful thing for them to do.

Instead, a good CMS should allow the administrator to change which
processor is used to convert raw text into HTML. When I experimented
with Moveable Type, I had to create a tool to allow me to use
MultiMarkdown (for obvious reasons that is my preferred variant...). It
was important to me, so I did it. The only thing I felt that MT was
"responsible" for was to allow me to customize this.

I'm left to believe that if you want such a CMS, it will be up to you to
create it, or to modify an existing CMS to support this. You could
presumably create a MT plugin that reads a central preference, and
"calls out" to the specified Markdown processor of your choice to handle
formatting.

I don't think it's realistic to expect that all, or even a significant
minority, of available CMS packages are going to support this. Your
best bet is to implement this "bottom-up" by creating plugins to manage
this on your own. I don't see where any sort of an internet "standard"
is necessary.

IMHO, a much more useful standardization effort is to continue to
develop test suites with as many edge cases as possible to help with
consistency in how the various flavors handle "standard" Markdown syntax
structures. As I've suggested in the past, I think it would be useful
for there to be agreement on:

* "Standard" Markdown -- the features in Markdown.pl, with appropriate
bug fixes

* "Common extended" Markdown (or whatever it would have to be called) --
features that are commonly added that are easily standardized (e.g.
footnotes, not using underscores in the middle of words, whatever). The
list of features in this category should be quite short.

* Everything else --- you want a symbol for inserting pink bunnies? Go
for it. But no one else is likely to follow suit.

This would allow users to know whether a given flavor passes either the
"standard certification" or "extended certification" (change the
terminology as you see fit).

MultiMarkdown, for example, has a test suite that continues to grow as
various edge cases are discovered -- I test both the Markdown
compatibility mode, and the full MultiMarkdown feature set. Karl posted
about his [test suite][1] on github, and I patched MMD to pass the one
test in his suite that it failed.

I suspect most users would be fine using any flavor of Markdown that
passed the "standard" compatibility. The majority of the remainder
would be happy with any flavor passing "extended" compatibility. Any
one else is going to understand enough about Markdown to pick their own
flavor and handle any issues that come up.

FTP

[1]: https://github.com/karlcow/markdown-testsuite

--
Fletcher T. Penney
***@fletcherpenney.net

Dennis E. Hamilton

2014-07-15 15:38:02 UTC

Concerning the problem of MIME type being known to storage systems and conveyed in HTTP responses, but not in anything like attached metadata, I believe the generic solution is known as #!. The nice thing about having processor and variants separate from the generic name, as in #!md [more-stuff]... is that it solves a problem that has always been an issue where the first term is used as an application-program association too. In this case, whatever processor is picked up at that stage can either process the [more-stuff]... or not.

This sort of thing can get weighty, so one might expect an md processor to treat immediately-following #! lines as continuations of the first one.

-- Dennis E. Hamilton
***@acm.org +1-206-779-9430
https://keybase.io/orcmid PGP F96E 89FF D456 628A
X.509 certs used and requested for signed e-mail

(Yes, I have been thinking about this a great deal, although I was thinking in terms of wikiTexts, of which md is a flavor. If you federate wikis and transclude content, this sort of thing becomes important.)

-----Original Message-----
From: Markdown-Discuss [mailto:markdown-discuss-***@six.pairlist.net] On Behalf Of Sean Leonard
Sent: Tuesday, July 15, 2014 06:26
To: markdown-***@six.pairlist.net
Subject: Re: Punchline: variants and processor (text/markdown)

On 7/15/2014 5:59 AM, Michel Fortin wrote:
[ ... ]

Post by Michel Fortin
But how does a document get annotated with the attributes in the first place? Who chooses the processor and variant attributes of a document and based on what? And where is it stored? Do you have any specific example of how that could work in any given setup?

Sean Leonard

2014-07-13 00:26:38 UTC

I think I can move on to my next question:

It seems that all Markdown content is expected to appear inside of a
block-level element in HTML parlance; i.e., inside <body> or one of its
block-level descendants (<div>, <p>, <td>, <form>, <h1>...<h6>, etc.).

I tried to do some <head> stuff, as in:
http://johnmacfarlane.net/babelmark2/?text=%3Chead%3E%3Ctitle%3EHello+World%3C%2Ftitle%3E%3Cmeta+name%3D%22author%22+content%3D%22Alice%22%3E%3C%2Fhead%3E%0A%0AI+am+some+text.%0A%3Cdiv%3Eand+i+am+inside+*myself*%3C%2Fdiv%3E%0A%0AThe+end.

And not surprisingly, the results are all over the place. Clearly this
is not an effective way to communicate HTML metadata, since Markdown is
designed to process HTML block-level content.

Therefore, *when it matters*, what are strategies that Markdown users
currently use to manage HTML metadata such as those metadata items
defined in <http://www.w3.org/TR/html5/document-metadata.html> and
<http://www.w3.org/TR/html401/struct/global.html#h-7.4>?

I am interested in items such as:
title
meta name info (author, generator, description, keywords)
link rel (stylesheet, icon, etc.)
language (either http-equiv content-language, or <html lang="XX">)
date [not part of HTML, but see pandoc_title_block]
?

I recognize that in many use cases, Markdown is for content fragments:
stick this blob of text somewhere in a page and be done with it. But
increasingly there are Markdown files (.md, .markdown) that are being
treated as discrete documents. So for those latter cases, some metadata
is desirable.

Are the following also true (or aesthetically agreeable)?
- there are no concerted CROSS-TOOL efforts to insert metadata into
Markdown streams
(I am aware of pandoc_title_block)
- inserting metadata into Markdown streams in a CROSS-TOOL way would be
kludgey
e.g. use an inert comment at the top:
[/Title/]: # (This comment could include metadata)
(but nobody does this)

-Sean

Karl Dubost

2014-07-13 00:52:15 UTC

Therefore, *when it matters*, what are strategies that Markdown users currently use to manage HTML metadata such as those metadata items defined in

search for multi-markdown.
http://fletcher.github.io/MultiMarkdown-4/metadata

--
Karl Dubost 🐄
http://www.la-grange.net/karl/

Shane McCarron

2014-07-13 01:03:07 UTC

We did some work on accessible markdown a year ago. Adding RDFa and aria
markup to help add metadata to the content. I don't think I have any good
pointers right now but it was all about making sure the generated HTML was
wcaag company and semantically meaningful.

Post by Sean Leonard
Therefore, *when it matters*, what are strategies that Markdown users

currently use to manage HTML metadata such as those metadata items defined
in
search for multi-markdown.
http://fletcher.github.io/MultiMarkdown-4/metadata
--
Karl Dubost ð
http://www.la-grange.net/karl/
_______________________________________________
Markdown-Discuss mailing list
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Waylan Limberg

2014-07-13 01:27:45 UTC

Post by Karl Dubost

Therefore, *when it matters*, what are strategies that Markdown users currently use to manage HTML metadata such as those metadata items defined in

search for multi-markdown.
http://fletcher.github.io/MultiMarkdown-4/metadata

Yes, that is one example. A few other implementations have similar extensions. However, I think the best example is Jekyll [1], the static file generator behind GitHub Pages (admittedly, Jekyll is not a markdown parser, but a tool that uses one). Although its metadata syntax is not really that much different that the other metadata extensions, it is important to note that Jekyll supports more than one text format (markdown, textile). Behind the scenes, the code removes the "frontmatter" first (which is passes on to a YAML parser), then passes the remaining text on to the appropriate parser. The point is that the one file contains 2 documents: a YAML document and markdown document; each parsed by a separate tool. So, while other markdown parsers may parse the frontmatter with the same tool, I still think of the metadata as being something other than markdown.

I should also point out that a number of projects will use the first <h1> Header in the document as the title. And if the file is stored on the file system, the creation and modification date may be pulled from the file system. Some even use the file name for the title (converting underscores to spaces and title casing). But those are the least flexible systems. The most flexible systems generally store the metadata in separate columns in a database alongside the markdown.

One thing is for certain, there is absolutely no standardization regarding metadata associated with markdown documents and many (most?) parsers do nothing to address the issue.

IMO, pure markdown is just human readable HTML fragments. That, I guess, is part of the reason why I asked why we need a mime type way back in my first response. Those HTML fragments don't really stand on there own, so why would a pure markdown file be transported on its own outside of some container that contains all that other metadata? Especially when that container already has a mime type of its own.

[1]: http://jekyllrb.com/docs/frontmatter/

Waylan

Fletcher T. Penney

2014-07-13 01:45:30 UTC

These are some of the things that lead to me releasing MultiMarkdown 9 years ago:

* The realization that Markdown documents could be complete documents, and not just a snippet of text to be inserted in a blog CMS

* That these complete documents would need some sort of metadata (Gruber was not a fan of this idea)

* That Markdown could be converted to more than just HTML (e.g. LaTeX, etc.)

The MultiMarkdown metadata syntax was based on a blosxom plugin (I believe it was simply called meta??)

I would recommend checking out MMD (in addition to pandoc as you mentioned) if you're interested in Markdown related tools that support metadata.

FTP
--
Fletcher T. Penney

It seems that all Markdown content is expected to appear inside of a block-level element in HTML parlance; i.e., inside <body> or one of its block-level descendants (<div>, <p>, <td>, <form>, <h1>...<h6>, etc.).

<snip>

Therefore, *when it matters*, what are strategies that Markdown users currently use to manage HTML metadata such as those metadata items defined in <http://www.w3.org/TR/html5/document-metadata.html> and <http://www.w3.org/TR/html401/struct/global.html#h-7.4>?
title
meta name info (author, generator, description, keywords)
link rel (stylesheet, icon, etc.)
language (either http-equiv content-language, or <html lang="XX">)
date [not part of HTML, but see pandoc_title_block]
?
I recognize that in many use cases, Markdown is for content fragments: stick this blob of text somewhere in a page and be done with it. But increasingly there are Markdown files (.md, .markdown) that are being treated as discrete documents. So for those latter cases, some metadata is desirable.
Are the following also true (or aesthetically agreeable)?
- there are no concerted CROSS-TOOL efforts to insert metadata into Markdown streams
(I am aware of pandoc_title_block)
- inserting metadata into Markdown streams in a CROSS-TOOL way would be kludgey
[/Title/]: # (This comment could include metadata)
(but nobody does this)
-Sean
_______________________________________________
Markdown-Discuss mailing list
http://six.pairlist.net/mailman/listinfo/markdown-discuss

John MacFarlane

2014-07-13 06:21:16 UTC

Post by Sean Leonard
It seems that all Markdown content is expected to appear inside of a
block-level element in HTML parlance; i.e., inside <body> or one of
its block-level descendants (<div>, <p>, <td>, <form>, <h1>...<h6>,
etc.).
http://johnmacfarlane.net/babelmark2/?text=%3Chead%3E%3Ctitle%3EHello+World%3C%2Ftitle%3E%3Cmeta+name%3D%22author%22+content%3D%22Alice%22%3E%3C%2Fhead%3E%0A%0AI+am+some+text.%0A%3Cdiv%3Eand+i+am+inside+*myself*%3C%2Fdiv%3E%0A%0AThe+end.
And not surprisingly, the results are all over the place. Clearly this
is not an effective way to communicate HTML metadata, since Markdown
is designed to process HTML block-level content.
Therefore, *when it matters*, what are strategies that Markdown users
currently use to manage HTML metadata such as those metadata items
defined in <http://www.w3.org/TR/html5/document-metadata.html> and
<http://www.w3.org/TR/html401/struct/global.html#h-7.4>?
title
meta name info (author, generator, description, keywords)
link rel (stylesheet, icon, etc.)
language (either http-equiv content-language, or <html lang="XX">)
date [not part of HTML, but see pandoc_title_block]
?

There is no standardization here. However, pandoc has moved on to a
more flexible system allowing structured YAML metadata, which may be
placed anywhere in the document.

http://johnmacfarlane.net/pandoc/README.html#yaml-metadata-block

Sean Leonard

2014-07-11 08:26:34 UTC

Post by Michel Fortin
I sure wish things would be simpler. But as things are now, I have a hard time identifying what "flavor" could mean. Should "Markdown.pl-1.0.1" be a flavor on its own?

Thanks Carl. I am starting to toy with this idea...I was thinking of
calling it "markwith" (for what you call "processor") and "deviations"
or "variations" (for what you--and I--call "flavor"). More on this shortly.

-Sean

Aristotle Pagaltzis

2014-07-10 03:06:34 UTC

Post by Sean Leonard
Markdown has no way to communicate the character set in the document
(other than the Unicode Byte Order Marks, which is a generalized
property about text streams, not specific to Markdown)--and it would
be counterproductive to invent one. So that is a perfect example of
relevant metadata. And the second one, is how to turn it into
something else that the author wants. If it's not communicated, it's
going to be implied. Implied means "guessing" and likely "guessing
wrong".

Yet guessing wrong is largely without consequence.

There are really no syntax features that affect the document’s rendering
non-locally. If part of a document is written with unsupported syntax,
only that part will render incorrectly, but the other parts will come
out fine.

And there are no large overlapping surfaces among the syntaxes of the
various extensions (esp. those for very different document features),
which makes unsupported syntax unlikely to appear to have been intended
to be rendered as some completely dissimilar feature.

So you will get a document that differs from the author’s intent in some
way. But it will be clear *where* the differences are and you will still
get all of the data in *some* form, quite possibly fully intelligible if
not pretty.

And because of the primary goal of Markdown to be human-readable in its
source form, there is always an easy and cheap last resort: view source.

Bottom line, misrendering a document due to wrong choice of flavour is
annoying but inconsequential, due to the very nature of Markdown.

Therefore the flavour parameter ought to be considered nothing more than
loosely informative, and the processor should just render the document
to the best of its ability regardless of the flavour specified. It MAY
use the parameter value to adapt to the document, in RFC 2119 lingo, but
ought not be bound by it.

Furthermore, an absent flavour parameter ought to mean that the flavour
is unspecified, not that it is any particular default flavour; i.e. the
choice of flavour in that case ought to be up to the processor.

Lastly, the spec should mention (as informal guidance to implementors)
that applications containing Markdown processors which have any chance
of being exposed to source documents of unknown flavour should, if at
all possible, provide a means for the user to view the source Markdown
document in unformatted form.

Regards,

--
Aristotle Pagaltzis // <http://plasmasturm.org/>

Karl Dubost

2014-07-10 04:00:46 UTC

Aristotle,

Post by Aristotle Pagaltzis
Lastly, the spec should mention (as informal guidance to implementors)
that applications containing Markdown processors which have any chance
of being exposed to source documents of unknown flavour should, if at
all possible, provide a means for the user to view the source Markdown
document in unformatted form.

Very good point. Added to
https://github.com/karlcow/markdown-testsuite/issues/59

--
Karl Dubost 🐄
http://www.la-grange.net/karl/

Sean Leonard

2014-07-10 04:25:44 UTC

Post by Aristotle Pagaltzis

There are two use cases that I am particularly interested in:
#1 You put .md files in a project (readme.md, etc.). These .md files are
then passed around among project users, which may include developers,
copy-writers, copy-editors, etc. They need to be sure that the readme.md
is treated in the same way, which ought to be communicated with the
data. If one person edits the document in UTF-8 and commits and another
person edits the document in ISO-8859-1, you're going to have problems.

#2 You have some app (let's say some web forum for example, but it
literally could be anything, an electronic health record, some national
criminal records, whatever) and you export data from the app. Say to
some structured data format like XML or a sqlite database. Part of data
liberation or backup or whatever. You want to get whatever your users
actually input into the fields--not the HTMLized versions. So you need
to annotate the blobs of data as Markdown, since users like to upload
various kinds of data (Word docs, JPEG images, MP4 videos, bits of text
like names of individuals, whatever).

In both cases, rendering matters "non-locally".

Post by Aristotle Pagaltzis
And there are no large overlapping surfaces among the syntaxes of the
various extensions (esp. those for very different document features),
which makes unsupported syntax unlikely to appear to have been intended
to be rendered as some completely dissimilar feature.

As someone new to Markdown development, I really want to see some
comprehensive references (since "authority" in Markdown-land is notably
absent). Besides, since Markdown is such a free-for-all, someone could
easily write a Markdown processor that turns (!) into
<script>alert('hello!');</script>.

Post by Aristotle Pagaltzis
So you will get a document that differs from the author’s intent in some
way. But it will be clear *where* the differences are and you will still
get all of the data in *some* form, quite possibly fully intelligible if
not pretty.

For what we might call "sensible flavors" of Markdown, yes. But the
author's intent may be poorly represented when processed through a tool
that injects lolcat pictures every third word. Or, the author's intent
may be very well-represented.

The point is...we don't know what the author's intent is, /unless the
author tells us/. And I think we need some more metadata to make the
author's intent clear.

Post by Aristotle Pagaltzis
And because of the primary goal of Markdown to be human-readable in its
source form, there is always an easy and cheap last resort: view source.

This is a goal. Agreed.

Post by Aristotle Pagaltzis
Therefore the flavour parameter ought to be considered nothing more than
loosely informative, and the processor should just render the document
to the best of its ability regardless of the flavour specified. It MAY
use the parameter value to adapt to the document, in RFC 2119 lingo, but
ought not be bound by it.

I would reword this:

The flavor parameter informs recipients of the author's intent. The processor should just render the document to the best of its ability regardless of the flavor specified. It SHOULD
use the parameter value to adapt to the document.

I don't know what should happen if the flavor is absent. I am trying to
understand. Let me put it this way: if you come across un-annotated
Markdown in the wild (as in, not attached to any processing scripts,
instructions, directions, whatever), what do you do? "Guess?"

Post by Aristotle Pagaltzis
Furthermore, an absent flavour parameter ought to mean that the flavour
is unspecified, not that it is any particular default flavour; i.e. the
choice of flavour in that case ought to be up to the processor.

The choice of how to act on the Markdown is /always/ up to the
processor...so...probably. It just may not represent the author's intent.

Between this and the Gruber discussion, I need to get used to this idea
that "guessing" is a normative part of Markdown culture. :)

Agreed on that one. I will include something like that in the next draft.

-Sean

Aristotle Pagaltzis

2014-07-10 13:30:10 UTC

Post by Aristotle Pagaltzis
Yet guessing wrong is largely without consequence.
There are really no syntax features that affect the document’s
rendering non-locally. If part of a document is written with
unsupported syntax, only that part will render incorrectly, but the
other parts will come out fine.

There are two use cases that I am particularly interested in: #1 You
put .md files in a project (readme.md, etc.). These .md files are then
passed around among project users, which may include developers,
copy-writers, copy-editors, etc. They need to be sure that the
readme.md is treated in the same way, which ought to be communicated
with the data. If one person edits the document in UTF-8 and commits
and another person edits the document in ISO-8859-1, you're going to
have problems.
#2 You have some app (let's say some web forum for example, but it
literally could be anything, an electronic health record, some
national criminal records, whatever) and you export data from the app.
Say to some structured data format like XML or a sqlite database. Part
of data liberation or backup or whatever. You want to get whatever
your users actually input into the fields--not the HTMLized versions.
So you need to annotate the blobs of data as Markdown, since users
like to upload various kinds of data (Word docs, JPEG images, MP4
videos, bits of text like names of individuals, whatever).
In both cases, rendering matters "non-locally".

I’m afraid you entirely misunderstood what I meant by non-local.

What I was referring to is, e.g. in HTML you can insert tags at the top
of a document such as `<table>` or `<pre>` which then change the way the
entire remaining document is to be rendered. They affect the document
non-locally.

Markdown does not have such constructs. If you include a Markdown Extra
table in the document and you put that document through Markdown.pl, you
will get a garbled form of the source of the table syntax as output for
the table, but the misrendering is only local. The rest of the document
will be unaffected and will render correctly.

As someone new to Markdown development, I really want to see some
comprehensive references (since "authority" in Markdown-land is notably
absent).

I’m afraid you will have to first find and then survey all of processors
yourself. The closest there is to central coordination is discussion on
this list, but it’s more of a users list that a lot of implementors seem
to shun (partly or fully) and others are unaware of.

Besides, since Markdown is such a free-for-all, someone could easily
write a Markdown processor that turns (!) into
<script>alert('hello!');</script>.

Sure, someone could, but who would use it? There is no point in basing
any technical considerations on this.

Post by Aristotle Pagaltzis
So you will get a document that differs from the author’s intent in
some way. But it will be clear *where* the differences are and you
will still get all of the data in *some* form, quite possibly fully
intelligible if not pretty.

It makes no sense to me to consider obviously silly pseudo-flavours just
because anything can claim to implement Markdown. What author is going
to write a real document using such a processor, and what user is going
to try and read Markdown documents with it?

The point is...we don't know what the author's intent is, /unless the
author tells us/.

And he has: he said it’s Markdown. It may not be entirely clear which
flavour, but that alone is a lot more than nothing. Now he sure should
be able to explain himself more specifically than that, but the user is
not dependent on more detail to make reasonably much sense out of the
document.

Post by Aristotle Pagaltzis
Therefore the flavour parameter ought to be considered nothing more
than loosely informative, and the processor should just render the
document to the best of its ability regardless of the flavour
specified. It MAY use the parameter value to adapt to the document,
in RFC 2119 lingo, but ought not be bound by it.

The flavor parameter informs recipients of the author's intent. The
processor should just render the document to the best of its ability
regardless of the flavor specified. It SHOULD use the parameter value
to adapt to the document.

MAY, MUST or bust. SHOULD is almost automatically bad idea and should be
employed very sparingly (though should also not be shied away from when
warranted).

Note that RFC 2119 “SHOULD” is not the same as English “should”.

I don't know what should happen if the flavor is absent. I am trying
to understand. Let me put it this way: if you come across un-annotated
Markdown in the wild (as in, not attached to any processing scripts,
instructions, directions, whatever), what do you do? "Guess?"

Yes! That was what my entire mail was saying: you just guess. And if you
guess wrong, nothing much happens. The result looks a little ugly and
the user goes View Source and end of story. And that’s if they cannot
decipher the intended meaning at all.

Post by Aristotle Pagaltzis
Furthermore, an absent flavour parameter ought to mean that the
flavour is unspecified, not that it is any particular default
flavour; i.e. the choice of flavour in that case ought to be up to
the processor.

The choice of how to act on the Markdown is /always/ up to the
processor...so...probably. It just may not represent the author's intent.
Between this and the Gruber discussion, I need to get used to this
idea that "guessing" is a normative part of Markdown culture. :)

The thing is Markdown is not terribly hard to process and it’s easy to
support extra syntax or change interpretation of things slightly.

Part of it is even Gruber himself; his last release is a beta with some
small differences in syntax from the previous stable release, which he
never superseded. Furthermore he has agreed with certain proposed tweaks
such as forbidding intra-word underscore emphasis that then he never got
around to putting in code himself, but have been adopted elsewhere.

So naturally a lot of people have taken it and run with it, in all sorts
of directions. The fact that it accommodates this (while simple uses of
basic features work the same everywhere) is part of the appeal. The core
core features are well picked and well designed, so they are attractive
to take as a basis for anyone who wants to design a nice human-readable
shorthand syntax – no need to go through all the basics, just spec out
the one other thing you need and implement it. By calling your own thing
Markdown+extensions you get to profit at least partially from a lot of
software that already exists.

Of course the result is a highly informal and highly fractured landscape
where no two implementations agree on every edge case use of the syntax
and any given syntax extension likely has only a single implementation.
Trying to put this in any order is not going to be easy, if it is even
possible.

But actually I don’t know that Markdown would have been as successful as
it is if it were more strongly formalised. That it makes an attractive
platform for one’s own extensions is probably why it has spread so much:
people extending it do so with their own extensions as their goal, but
thereby implicitly help the core Markdown feature set reproduce itself
in another implementation.

One might say that Markdown is a highly virulent meme, in the original
Dawkins sense of the word.

Regards,

--
Aristotle Pagaltzis // <http://plasmasturm.org/>

Fletcher T. Penney

2014-07-10 16:27:14 UTC

This isn't entirely true. For example, try to insert a list immediately
preceding an indented (not fenced) code block, where the code block is
*not* part of the list. By doing so, you convert the code block into a
paragraph that becomes part of the last list item. Indenting with one
more tab results in a code block, but that code block is still part of
the list. This seems to be one of the first "gotchas" that gets many
new users to Markdown.

I suppose it's debatable as to how "non-local" that effect is
considered, since the code block immediately follows the list. But
inserting one structure (the list) breaks something that previously
worked (the code block).

A less debatable example is fenced code blocks (not part of Markdown per
se, but part of many derivatives.) I was reluctant to include fenced
code blocks in MultiMarkdown for just this reason. To my chagrin, after
adding fenced code block support to MMD, I then realized just how much
of a bad idea they are. They are the one syntax element you can't
identify by simply looking local to that portion of the text. You have
to start all the way at the beginning, which can be a headache if you
have a novel contained in a single text file.

Consider the following excerpt from a document:

```

This is *text*

```

This is more *text*

```

Without knowing the entire document before the "snippet", one cannot
determine which of those sentences is a code block, and which is a
paragraph. The first row of single quotes could be starting a new code
block, or it could be closing a previous code block. One accidental
fence delimiter at the beginning of a document could alter the meaning
of the entire thing. It's basically an "even/odd" problem, and the only
way to know if a particular fence is even or odd is to count all the
preceding fences.

This may not really matter that much when a single person creates a
single document to be converted once into a single web page. But in
more complicated real world use, this can be problematic.

I don't know that there is much to do about it at the moment, and I
don't currently plan to yank fenced code blocks from MMD. But I mention
it so that it can be considered when proposing additional new features
on top of the Markdown base set.

FTP

Post by Aristotle Pagaltzis
I’m afraid you entirely misunderstood what I meant by non-local.
What I was referring to is, e.g. in HTML you can insert tags at the top
of a document such as `<table>` or `<pre>` which then change the way the
entire remaining document is to be rendered. They affect the document
non-locally.
Markdown does not have such constructs. If you include a Markdown Extra
table in the document and you put that document through Markdown.pl, you
will get a garbled form of the source of the table syntax as output for
the table, but the misrendering is only local. The rest of the document
will be unaffected and will render correctly.

--
Fletcher T. Penney
***@fletcherpenney.net

Aristotle Pagaltzis

2014-07-10 19:08:54 UTC

Post by Fletcher T. Penney
I suppose it's debatable as to how "non-local" that effect is
considered,

I do consider that local. “Local-enough” at least. Tellingly though it’s
also the most common single reason for “how do I” user inquiries on this
list.

Post by Fletcher T. Penney
A less debatable example is fenced code blocks

Yes, true.

The annoying thing is, while I was against them when the proposal was
brought up for Markdown in general, after spending some time using them
(by way of GitHub), I do find them convenient as a user. So now I have
a dilemma.

— • —

Still though – because fenced code blocks are an extension, they don’t
much change the principle of what I said: if you use a fenced code
block, then try to render the document using some other processor which
doesn’t support them, you get a garbled code block and then the rest of
the document after that looks fine.

Conversely a documents written for processors that do not support fenced
code blocks are rather unlikely to contain something that looks as if it
were – which would then lead to large-scale botch if you tried to render
them using a processor that supports fenced code blocks. (I feel this is
especially so for backtick fences. Tilda fences seem to have some remote
likelihood of being used as an innocuous part of a non-fenced-code-block
document.)

Regards,

--
Aristotle Pagaltzis // <http://plasmasturm.org/>

Karl Dubost

2014-07-10 01:18:29 UTC

Hi,

Post by Michel Fortin
Markdown is in the spot where HTML was before HTML5 with each implementation doing its own thing. I don't know if Markdown will get out of there anytime soon.

Yes basically. And it's why Ciro Santilli has done an amazing work upon the [test suite][1] I have started.

[1]: https://github.com/karlcow/markdown-testsuite

The "issue" with Markdown and its flavors is that it is mainly used:

* as an input format for something else, aka in a converter scenario
* more than an exchange format with multiple emitters/consumers needing interoperability.

Post by Michel Fortin
I'll point out however that HTML never got anything like a "flavor" parameter in its MIME type, and even if it did it'd not have helped clear the mess in any way.

--
Karl Dubost 🐄
http://www.la-grange.net/karl/

Fletcher T. Penney

2014-07-10 02:01:55 UTC

You appear to be testing MultiMarkdown in standard mode, rather than compatibility mode (`-c`). If you're going to test against standard Markdown syntax you should use compatibility mode as it disables the additional features that alter the output (e.g. smart typography, anchors on headers, etc.).

When doing this properly, it appears that MMD fails only one test, which I will work on.

FTP
--
Fletcher T. Penney

Post by Karl Dubost
Hi,

Post by Michel Fortin
Markdown is in the spot where HTML was before HTML5 with each implementation doing its own thing. I don't know if Markdown will get out of there anytime soon.

Yes basically. And it's why Ciro Santilli has done an amazing work upon the [test suite][1] I have started.
[1]: https://github.com/karlcow/markdown-testsuite
* as an input format for something else, aka in a converter scenario
* more than an exchange format with multiple emitters/consumers needing interoperability.

Post by Michel Fortin
I'll point out however that HTML never got anything like a "flavor" parameter in its MIME type, and even if it did it'd not have helped clear the mess in any way.

Yup agreed. A [MIME type][2] is useful in the case of "exchange format" when an emitter and a receiver needs to understand what they are exchanging. In the case of the input format, there is no issue because the environment is constrained. When you play with multiple clients, the interoperability story becomes interesting.
[2]: http://tools.ietf.org/html/draft-seantek-text-markdown-media-type-00
--
Karl Dubost ð
http://www.la-grange.net/karl/
_______________________________________________
Markdown-Discuss mailing list
http://six.pairlist.net/mailman/listinfo/markdown-discuss

John MacFarlane

2014-07-10 05:18:37 UTC

This seems a reasonable proposal to me. Like Michel Fortin, though,
I suspect the "flavors" part will be more trouble than it's worth.
Is there going to be a distinct flavor for every version of pandoc,
for example? What about people who use pandoc but disable one or
two of the pandoc extensions (which you can do with a command line
flag)? Your document mentions "github flavored markdown," but there
are actually two distinct github flavors, one for displaying long-form
documents like READMEs (here hard line breaks are treated as spaces,
as in original markdown), and one of issues and comments (here hard
line breaks are rendered as hard breaks). It sounds like a LOT of
work to keep the registry of flavors up to date, unless the flavors
are going to be very coarse-grained.

John

Post by Sean Leonard
I am working on a Markdown effort in the Internet Engineering Task
Force, to standardize on "text/markdown" as the Internet media type
for all variations of Markdown content. You can read my draft here: <http://tools.ietf.org/html/draft-seantek-text-markdown-media-type-00>.
The proposal is already getting traction. Is there anyone on this list
that is interested in participating or helping this effort? In
particular we need to better understand and document what versions of
Markdown exist, so that either Markdown as a family of informal
syntaxes will start to converge, or if not, that Markdown variations
have an easy way to be distinguished from one another. (See the
"flavor" parameter discussed in the draft.)
Kind regards,
Sean Leonard
Author of Markdown IETF Draft
_______________________________________________
Markdown-Discuss mailing list
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Waylan Limberg

2014-07-10 15:49:54 UTC

On Jul 09, 2014, at 11:49 AM, Sean Leonard <dev+***@seantek.com> wrote:

Hi markdown-discuss Folks:

I am working on a Markdown effort in the Internet Engineering Task
Force, to standardize on "text/markdown" as the Internet media type for
all variations of Markdown content. You can read my draft here:
<http://tools.ietf.org/html/draft-seantek-text-markdown-media-type-00 Â Â Â Â >.

My response below is lengthy but covers a number of different points including some raised later in the discussion by others.Â

Sean, have you reached out to Mr. Gruber specifically? I mention this because in the past I have CCed him directly on a response I sent to this list which prompted him to respond (admittedly that happened some years ago). I suspect he might be amenable to the general idea though. A search of the list archives turned up a previous discussion [1] where he indicated a willingness to put in some work to obtain a mime type for markdown. Of course, that was back when he was still actively involved. Your mileage may vary.

[1]:Â http://article.gmane.org/gmane.text.markdown.general/1179

In any event, I have some thoughts about your proposal. I like it for the most part. But a few comments on some specifics:

Why do we need a Mime Type?
----------------------------------------

First of all, when is this necessary? In order words, when is plain markdown being sent around such that it needs a mime type? In my experience, REST API's (for example) use JSON or XML which may contain some Markdown text among other data. That other data may identify that the text is "markdown", but the mime type for the file is JSON or XML (or at least the appropriate mime type for that file type). Or are you proposing that everyone standardize on a way to identify the markdown text within JSON and XML documents as Markdown text? What am I missing here?

Encodings
--------------

To shed a different light on the encoding issue, consider Python-Markdown (disclosure: I'm the primary developer). Just as in Python 3 (where all strings are Unicode), Python-Markdown only works with Unicode. You pass Unicode text in, and you get Unicode text out. It is up to the user of the library to concern themselves with encoding and decoding a file to/from a specific encoding. As Python provides the libraries to do that, it is not a big problem -- although for those used to working with byte strings it may be a little jarring (I'm seeing that reaction from people who are experimenting with Apple's new Swift Language -- which also supports Unicode only strings).

The point is, the Python-Markdown implementation has no use for the encoding (except for the included wrapping commandline script). Of course, the user (user of the library) will care about that and will need some way to identify the encoding before converting and passing the input on to the Python-Markdown library. So yes, encoding is very much a real, needed piece of meta-data.

However, if the markdown text is included in a JSON file (see my previous point above), then wouldn't the encoding be defined for the JSON file, not the markdown text specifically. The JSON parsing library would just spit out a Unicode string -- in which case, why do we need this?

Flavors
---------

To me, "flavors" seems like a disaster waiting to happen. Sean, I realize you have specifically stated a lack of understanding here, so lets go back in time. The following may not be an all-inclusive (or in proper order of events) history of Markdown, but provides enough (I hope) to make a point.

Way back when, the "flavor" of markdown you used depended almost entirely on which language (Perl, PHP, Python...) you were using to code your project (blog, wiki, CMS, etc.). If you where using PHP, them your flavor was PHP Markdown... There was only one implementation per language and they (mostly?) agreed with each other. In that day "flavor" was completely pointless. I suspect a number of us resistant to the "flavors" part of your proposal are from that period in Markdown's history.

Of course, then Ruby came along. I don't remember which library was which, but when the first library came out, it was not very good (lots of bugs and slow). Then a second library came out which also wasn't very good, but in different waysÂ (except for the slow part). Some people wrote their markdown documents with the bugs of the first implementation in mind, while other's wrote their documents with the second in mind. Then a few projects started offering users the option to pick which Ruby implementation of Markdown to use for each individual document - and "flavors" were born. Then other people started making ports of those projects to other languages and the "flavors" followed -- even though the other languages didn't really have any choices. As a reminder, Github came out of that Ruby culture, which might explain why Github-Flavored-Markdown ever existed in the first place (interesting side note: Gruber appears to like GFM [2] -- or at least the original release -- it has grown to include more features since then).

[2]:Â http://daringfireball.net/linked/2009/10/23/github-flavored-markdown

Then someone wrote a PEG grammar for Markdown. Once the hard work was done, a few people ported that grammar to other languages. And then a few people wrote C implementations (one of which used a PEG Grammar IIRC). Then, people wrote wrappers around the C libraries for any number of scripting languages (Perl, PHP, Python, Ruby...) and now there are a multitude of choices regardless of which language your project is coded in. Some time ago I started an incomplete [list]Â -- incomplete because those are the implementations I am aware of -- I'm sure there are some others.

[list]:Â https://github.com/markdown/markdown.github.com/wiki/Implementations

But for those of us that remember the pre-Ruby days there is only "one true implementation" per language and all the rest is just a bunch or noise (Okay, perhaps I exaggerate a bit -- just trying to make a point). For us "flavors" means something else entirely. Because before all this Ruby and C mess, we also had Multimarkdown and PHP Markdown Extra, more-or-less extending the same basic markdown syntax. Of course, those extensions are not identical, but given that each was implemented in a different language, it didn't matter. The "flavor" depended on which language your project was implemented in and that was it.

Of course, many of the extensions created in Multimarkdown and PHP Markdown Extra were then ported to other implementations in other languages. Consider Python-Markdown for instance. Python-Markdown provides an extension API so that any user of the library can write an extension which modifies the syntax in any way they wish -- to the point that it may not be Markdown any more. And a number of extensions ship with the Python-Markdown library [3]. Of those (at current count) 17 extensions, 7 of them also come under the umbrella of an 8th -- Extra. In other words, each individual feature of PHP Markdown Extra was implemented as its own extension, then when we had all of them, a wrapping extension (called "extra") was created as a shortcut. Some users use "extra", but others only use "footnotes" (for example). Any number of "flavors" are possible with the various combinations of extensions that ship with just this one library. And many of those extensions also accept user defined configuration settings which alters that extension's behavior (see footnotes [4] for an example). Then, there is a fairly extensive list of third party extensions [5] (which is always changing). I don't imagine that there is any sensible way to define all those possibilities in a way that is also understandable by other markdown implementations.

[3]:Â https://pythonhosted.org/Markdown/extensions/index.html
[4]:Â https://pythonhosted.org/Markdown/extensions/footnotes.html
[5]:Â https://github.com/waylan/Python-Markdown/wiki/Third-Party-Extensions

The great thing about Markdown is that any (decent) parser will simply pass over markup it doesn't understand. The text will just get passed through as (mostly) plain text. Given that one of the guiding principles behind Markdown is that it is human readable, if a particular implementation does not support a certain extension, the reader of the output could still understand the intended meaning and formatting (or at least "view source" as other's have mentioned). Of course, this depends on a number of factors (overridden tokens, HTML's whitespace collapsing considerations, etc). There are certainly many examples that that does not hold true for. But overall, I don't see that as a large concern.Â

So, the point (finally) is that "flavors" seem like an impossible-to-get-right part of your proposal and really won't matter in the real word. For example. if you send me some markdown text with a flavor of "markdown.pl", but I'm using Awk as my programming language, then I'm not going to use markdown.pl anyway. Or, if you send me a flavor of "extra", Awk doesn't have an implementation that supports "extra" (AFAIK), so, that is useless to me as well. On the other hand, if I'm using Python, I can account for "extra" easily. Or for "markdown.pl" (just turn off smart_emphasis [6]). But "multimarkdown" is a different matter (I'm not exactlyÂ sure which features are supported by Multimarkdown or whether Python-Markdown's extensions implement them in the same way). And then there's "gfm" and "pandoc" and ... so many variations to account for. I think I'll just ignore this flavor stuff and use theÂ implementationÂ of *my* choice which may or may not support the flavor sent my way.

[6]:Â https://pythonhosted.org/Markdown/reference.html#smart_emphasis

I hope that helps.

Waylan Limberg

Alan Hogan

2014-07-10 20:55:34 UTC

Post by Waylan Limberg
In my experience, REST API's (for example) use JSON or XML which may contain some Markdown text among other data. That other data may identify that the text is "markdown", but the mime type for the file is JSON or XML (or at least the appropriate mime type for that file type).

Im no [REST Police][1], but in my understanding, REST encourages the embracing of HTTP verbs to perform actions on hypertext objects. Consider an API call that updates the Markdown source of a blog post, for example. You are entirely correct that there is a strong chance that this API call would actually send an updated copy of a JSON object including fields such as title, date, url, and body, the last of which may implicitly or explicitly be Markdown data. (And the MIME type on that call would be application/json or whatever.) But perhaps the most RESTful way to do this would be to send a plain Markdown file (as text/markdown). (As far as the metadata goes, the server could extract the title from the markdown document itself (first level-one heading or first line of text, for example), set the date automatically, and so on.

Thats not to say that a JSON API for updating a blog post isnt RESTful, but rather that the non-JSON, pure-Markdown API is where the new MIME type would be most needed.

[1]: https://twitter.com/RESTPOLICE

Alan

Sean Leonard

2014-07-11 09:12:19 UTC

Post by Waylan Limberg

My response below is lengthy but covers a number of different points
including some raised later in the discussion by others.
Sean, have you reached out to Mr. Gruber specifically?

Yes, I e-mailed him...twice. We shall see if he responds.

Post by Waylan Limberg
Why do we need a Mime Type?

In my own selfish use case, I want to identify Markdown files in my
software projects as text/markdown (or *something*)...not just text/plain.

As I think I said before, Markdown is now being stored and exchanged
"as-is". It is those "as-is" cases where we want to identify Markdown as
it is, rather than its incorrect approximations (text/plain) or solely
in its output formats (text/html, etc.).

Post by Waylan Limberg
Encodings
--------------
[...]
However, if the markdown text is included in a JSON file (see my
previous point above), then wouldn't the encoding be defined for the
JSON file, not the markdown text specifically. The JSON parsing
library would just spit out a Unicode string -- in which case, why do
we need this?

It's a general property of text streams (and in particular, a general
property of Internet media types under the text/ main type) that the
streams have a designated encoding. Frequently, this encoding is
implicit, much like you can assume that Windows software uses
little-endian byte ordering, or that line breaks on *nix-derived
operating systems are <LF>, not <CRLF> or <CR>. The issue comes down to
interchange. When a Windows machine exchanges text files or data with a
Unix machine, how do you represent the concept of newlines in a common way?

To your specific point: if you encase Markdown content in JSON content
as a JSON string, you can probably assume the Markdown content uses the
Unicode character set. But you can have Markdown content as-is in its
own file or other protocol element, where there are no such assumptions.
For example, in your JSON example, the JSON content is likely in a file
(or XMLHttpRequest) that needs to be explicitly or implicitly labeled
with UTF-8 encoding.

Post by Waylan Limberg
Flavors
---------
[...]

Now that I know more, I am seriously rethinking that part of the
proposal. Thanks for the info!

-Sean

Aristotle Pagaltzis

2014-07-13 03:13:26 UTC