[ 
http://jira.codehaus.org/browse/DOXIA-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=132872#action_132872
 ] 

Benjamin Bentmann commented on DOXIA-236:
-----------------------------------------

bq. An anchor in HTML should be a valid HTML ID token, so IMO we could apply 
the same rules to a general sink.
Given HTML's popularity I generally agree in adopting its rules for the Sink 
API to keep the learning curve low.

bq. See the javadoc for HtmlTools.encodeId().
My question is just which side of the game (parser - sink) is responsible for 
doing this conversion? For instance, looking at {{AptParser}} and 
{{XhtmlBaseSink}}, we currently seem to allow parsers to call {{Sink.anchor()}} 
with an arbitrary string and require the sinks to normalize this string 
according to their needs/restrictions. I am fine with this approach, all I 
wanted are one or two lines in the javadoc of {{Sink.anchor()}} that explicitly 
express this freedom for parsers such that sink implementors are aware of their 
job to handle arbitrary inputs and don't erroneously assume something like a 
HTML ID token coming in.

Side note:
The javadoc for the method {{HtmlTools.encodeId()}} mentions the pattern 
{{[A-Za-z][A-Za-z0-9:_.-]*}} for its output. To me, this looks like the term 
"letter" in meant to refer to ASCII characters in this context. However, the 
employed method {{Character.isLetter()}} will classify characters according to 
the Unicode data file. For instance, the characters "ä" and "ß" are letters in 
the Unicode sense. {{encodeId()}} will pass these through to its output, 
violating the ASCII-only pattern stated in its javadoc.

bq. internal links should start with "#", again a la html.
In other words:
- {{AptParser}} must convert "foo.pdf" into "#foo.pdf" (assuming its current 
link interpretation) before passing the link to the sink
- {{XhtmlBaseSink}} must neither check for protocols nor for "./" nor for 
".html#" in the link string but only for a leading charp
Right?

bq. No pretty printing, no modification whatsoever.
Certain input formats (e.g. XML-based) allow for ignorable/collapsable 
whitespace. Removing/normalizing this should be the responsibility of the 
parser such that the sink only sees real content. I.e. pretty printing from the 
input document should be removed by the parser.

Also worth to clarify: Line terminators. Require sinks to handle all possible 
variants or require parsers to normalize these to "\n" as done in the XML spec?

bq. IMO a figureCaption can come before or after figureGraphics
That's right, the same for the table caption. Again, if this freedom is allowed 
for parsers, it should be documentated somewhere to let sink implementors know 
that they must be prepared to handle any possible sequence.

bq. Can you elaborate why this is relevant?
Because an API is a contract and there should be no guessing, no assumptions, 
no grey zones, the responsibilities of all parties should be clear. Imagine 
somebody wanted to implement a sink. When it comes to the {{close()}} method, 
may he throw an {{IllegalStateException}} on the second call? The API does 
neither allow nor prohibit this, so there is space for guessing, space for 
different implementations and space for breaking the interchangebility on 
sinks. I never wanted to come up with a new Doxia, all I am seeking for is a 
specification that allows people to clearly identify which component 
(parser/sink) is misbehaving in case of a wrong output document.

bq. IMO closing a sink a second time should just do nothing, as it basically 
just closes the underlying Writer.
+1, but again, this should simply be explicitly written down for everybody as a 
reference.


> Clarify Sink API
> ----------------
>
>                 Key: DOXIA-236
>                 URL: http://jira.codehaus.org/browse/DOXIA-236
>             Project: Maven Doxia
>          Issue Type: Task
>          Components: Sink API
>    Affects Versions: 1.0-alpha-2
>            Reporter: Benjamin Bentmann
>
> If the idea with extensibility and interchangeable input/output formats 
> should be more than a nice dream, the Sink API needs a thorough specification 
> (e.g. by means of more javadoc at {{Sink}}) because that's were everything 
> meets. It should define
> # what rules parsers must obey when generating events and
> # what events a sink needs to be prepared to handle
> Currently, all of this is left to assumptions. Some example issues that need 
> to be clarified:
> - What characters may constitute an anchor reported by {{anchor()}}? 
> Arbitrary, ASCII-only, ...?
> - What format applies to the {{name}} parameter of {{link()}}? How are 
> internal and external links to be distinguished (DOXIA-208)?
> - What character chunks are reported by {{text()}}? Longest consecutive 
> sequence, line-by-line, arbitrary, ... (DOXIA-222)?
> - What exactly is a figure's source as reported by {{figureGraphics()}}? 
> Relative/absolute path, relative to which directory? What about file 
> extensions (DOXIA-99)?
> - What order of events is "reasonable" (DOXIA-132)? May parsers report table 
> body and caption in a specific or arbitrary order? Must the document head 
> always be reported before body or may it be postponed? 
> - Is closing a sink twice acceptable or an error?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://jira.codehaus.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to