[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-16 Thread Ezio Melotti
Ezio Melotti added the comment: > 16ed15ff0d7c was not in current stable py3.2 so I missed it.. It's also in 3.2 and 2.7 (but it's quite recent, so if you didn't pull recently you might have missed it). > When the comma is now raised as attribute name, then the problem is > anyway moved to t

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-16 Thread kxroberto
kxroberto added the comment: 16ed15ff0d7c was not in current stable py3.2 so I missed it.. When the comma is now raised as attribute name, then the problem is anyway moved to the higher level anyway - and is/can be handled easily there by usual methods. (still I guess locatestarttagend_tolera

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-16 Thread Ezio Melotti
Ezio Melotti added the comment: Note that the regex and the way the parser considers the commas changed in 16ed15ff0d7c (it now considers them as the name of a value-less attribute), so adding a group for the comma is no longer doable. In theory, the approach you suggest might work, but if we

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-16 Thread kxroberto
kxroberto added the comment: The old patch warned already the majority of real cases - except the missing white space between attributes. "The tolerant regex will match both": locatestarttagend_tolerant: The main and frequent issue on the web here is the missing white space between attribut

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-16 Thread Ezio Melotti
Ezio Melotti added the comment: The strict/tolerant mode mainly works by using either a strict or a tolerant regex. If the markup is invalid, the strict regex doesn't match and it gives an error. The tolerant regex will match both valid and invalid markup at the same time, without distincti

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-16 Thread kxroberto
kxroberto added the comment: Well in many browsers for example there is a internal warning and error log (window). Which yet does not (need to) claim to be a official W3C checker. It has positive effect on web stabilization. For example just looking now I see the many HTML and CSS warnings an

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-15 Thread Ezio Melotti
Ezio Melotti added the comment: The HTMLParser is not suitable for validation, even the strict mode allows some non valid markup (and it might be removed soon). Also I don't think it's easy to call a self.warnings() without trying the strict mode first. The tolerant parsing just allow more th

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-15 Thread kxroberto
kxroberto added the comment: I looked at the new patch http://hg.python.org/lookup/r86952 for Py3 (regarding the extended tolerance and local backporting to Python2.7): What I miss are the calls of a kind of self.warning(msg,i,k) function in non-strict/tolerant mode (where self.error is calle

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-12-03 Thread R. David Murray
R. David Murray added the comment: A note for the curious: I changed the keyword name from 'tolerant' to 'strict' because the stdlib has other examples of 'strict' as a keyword, but the word 'tolerant' appears nowhere in the documentation and certainly not as a keyword. So it seemed better t

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-12-02 Thread R. David Murray
R. David Murray added the comment: I have committed a version of this patch, without the warnings, using the keyword 'strict=True' as the default, and with a couple added heuristics from other similar issues, in r86952. kxroberto, if you want to supply your full name, I'll add you to Misc/ACK

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-11-20 Thread Neil Muller
Neil Muller added the comment: #975556 and #1046092 look like they should also be superseded by this. -- nosy: +Neil Muller ___ Python tracker ___

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-09-04 Thread R. David Murray
R. David Murray added the comment: See also issue 1058305, which may be a duplicate. -- ___ Python tracker ___ ___ Python-bugs-list

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-27 Thread R. David Murray
R. David Murray added the comment: For anyone who does want to work on this (and I do, but it will be quite a while before I can) see also issue 6191. -- ___ Python tracker _

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-26 Thread kxroberto
kxroberto added the comment: I'm not working with Py3. don't how much that module is different in 3. unless its going into a py2 version, I'll leave the FR so far to the py3 community -- ___ Python tracker

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-24 Thread Terry J. Reedy
Terry J. Reedy added the comment: I agree that a tolerant mode would be good (and often requested). String encoding and decoding also have strict and forgiving modes, so this seems close to a policy. Unit tests with example snippets that properly fail strict mode and pass the new tolerant mo

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-24 Thread R. David Murray
R. David Murray added the comment: 2.6 is now in security-fix-only mode. Since this is a new feature, it can only go into 3.2. Can you provide a patch against py3k trunk? I've only glanced at the patch briefly, but one thing that concerns me is 'warning file'. I suppose that either the log

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-24 Thread kxroberto
Changes by kxroberto : -- versions: +Python 2.6, Python 2.7 Added file: http://bugs.python.org/file18624/test_htmlparser_tolerant.patch ___ Python tracker ___ _

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-24 Thread kxroberto
kxroberto added the comment: for me a parser which cannot be feed with HTML from outside (which I cannot edit myself) has not much use at all. attached my current patch (vs. py26) - many changes meanwhile. and a test case. I've put the default to strict mode, but ... -- Added file: ht

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-22 Thread R. David Murray
R. David Murray added the comment: I disagree (and might disagree with those other closings but I haven't noticed them I guess). BeautifulSoup does *not* cover this ground, it is broken in 3.x because of the lack of a tolerant HTML parser in the stdlib (it used to use sgmlib, which is now go

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-22 Thread Mark Lawrence
Mark Lawrence added the comment: I think this should be closed as have other similar requests in the last few days. -- nosy: +BreamoreBoy, fdrake ___ Python tracker ___ _

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-08 Thread Terry J. Reedy
Terry J. Reedy added the comment: This needs to be checked for applicability to 3.x. Do beautifulsoup and other programs cover this ground (tolerant parsing of junk html)? -- nosy: +terry.reedy versions: +Python 3.2 -Python 2.7, Python 3.1 ___ Pytho

[issue1486713] HTMLParser : A auto-tolerant parsing mode

2009-03-20 Thread Daniel Diniz
Changes by Daniel Diniz : -- stage: -> test needed type: -> feature request versions: +Python 2.7, Python 3.1 -Python 2.4 ___ Python tracker ___ _