[ http://jira.codehaus.org/browse/DOXIA-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=132872#action_132872 ]
Benjamin Bentmann commented on DOXIA-236: ----------------------------------------- bq. An anchor in HTML should be a valid HTML ID token, so IMO we could apply the same rules to a general sink. Given HTML's popularity I generally agree in adopting its rules for the Sink API to keep the learning curve low. bq. See the javadoc for HtmlTools.encodeId(). My question is just which side of the game (parser - sink) is responsible for doing this conversion? For instance, looking at {{AptParser}} and {{XhtmlBaseSink}}, we currently seem to allow parsers to call {{Sink.anchor()}} with an arbitrary string and require the sinks to normalize this string according to their needs/restrictions. I am fine with this approach, all I wanted are one or two lines in the javadoc of {{Sink.anchor()}} that explicitly express this freedom for parsers such that sink implementors are aware of their job to handle arbitrary inputs and don't erroneously assume something like a HTML ID token coming in. Side note: The javadoc for the method {{HtmlTools.encodeId()}} mentions the pattern {{[A-Za-z][A-Za-z0-9:_.-]*}} for its output. To me, this looks like the term "letter" in meant to refer to ASCII characters in this context. However, the employed method {{Character.isLetter()}} will classify characters according to the Unicode data file. For instance, the characters "ä" and "ß" are letters in the Unicode sense. {{encodeId()}} will pass these through to its output, violating the ASCII-only pattern stated in its javadoc. bq. internal links should start with "#", again a la html. In other words: - {{AptParser}} must convert "foo.pdf" into "#foo.pdf" (assuming its current link interpretation) before passing the link to the sink - {{XhtmlBaseSink}} must neither check for protocols nor for "./" nor for ".html#" in the link string but only for a leading charp Right? bq. No pretty printing, no modification whatsoever. Certain input formats (e.g. XML-based) allow for ignorable/collapsable whitespace. Removing/normalizing this should be the responsibility of the parser such that the sink only sees real content. I.e. pretty printing from the input document should be removed by the parser. Also worth to clarify: Line terminators. Require sinks to handle all possible variants or require parsers to normalize these to "\n" as done in the XML spec? bq. IMO a figureCaption can come before or after figureGraphics That's right, the same for the table caption. Again, if this freedom is allowed for parsers, it should be documentated somewhere to let sink implementors know that they must be prepared to handle any possible sequence. bq. Can you elaborate why this is relevant? Because an API is a contract and there should be no guessing, no assumptions, no grey zones, the responsibilities of all parties should be clear. Imagine somebody wanted to implement a sink. When it comes to the {{close()}} method, may he throw an {{IllegalStateException}} on the second call? The API does neither allow nor prohibit this, so there is space for guessing, space for different implementations and space for breaking the interchangebility on sinks. I never wanted to come up with a new Doxia, all I am seeking for is a specification that allows people to clearly identify which component (parser/sink) is misbehaving in case of a wrong output document. bq. IMO closing a sink a second time should just do nothing, as it basically just closes the underlying Writer. +1, but again, this should simply be explicitly written down for everybody as a reference. > Clarify Sink API > ---------------- > > Key: DOXIA-236 > URL: http://jira.codehaus.org/browse/DOXIA-236 > Project: Maven Doxia > Issue Type: Task > Components: Sink API > Affects Versions: 1.0-alpha-2 > Reporter: Benjamin Bentmann > > If the idea with extensibility and interchangeable input/output formats > should be more than a nice dream, the Sink API needs a thorough specification > (e.g. by means of more javadoc at {{Sink}}) because that's were everything > meets. It should define > # what rules parsers must obey when generating events and > # what events a sink needs to be prepared to handle > Currently, all of this is left to assumptions. Some example issues that need > to be clarified: > - What characters may constitute an anchor reported by {{anchor()}}? > Arbitrary, ASCII-only, ...? > - What format applies to the {{name}} parameter of {{link()}}? How are > internal and external links to be distinguished (DOXIA-208)? > - What character chunks are reported by {{text()}}? Longest consecutive > sequence, line-by-line, arbitrary, ... (DOXIA-222)? > - What exactly is a figure's source as reported by {{figureGraphics()}}? > Relative/absolute path, relative to which directory? What about file > extensions (DOXIA-99)? > - What order of events is "reasonable" (DOXIA-132)? May parsers report table > body and caption in a specific or arbitrary order? Must the document head > always be reported before body or may it be postponed? > - Is closing a sink twice acceptable or an error? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://jira.codehaus.org/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira