[issue30011] HTMLParser class is not thread safe
New submission from Alessandro Vesely: SYMPTOM: When used in a multithreaded program, instances of a class derived from HTMLParser may convert an entity or leave it alone, in an apparently random fashion. CAUSE: The class has a static attribute, entitydefs, which, on first use, is initialized from None to a dictionary of entity definitions. Initialization is not atomic. Therefore, instances in concurrent threads assume that initialization is complete and catch a KeyError if the entity at hand hasn't been set yet. In that case, the entity is left alone as if it were invalid. WORKAROUND: class Dummy(HTMLParser): """this class is defined here so that we can initialize its base class""" def __init__(self): HTMLParser.__init__(self) # Initialize HTMLParser by loading htmlentitydefs dummy = Dummy() dummy.feed('') del dummy, Dummy -- components: Library (Lib) messages: 291256 nosy: ale2017 priority: normal severity: normal status: open title: HTMLParser class is not thread safe type: behavior versions: Python 2.7 ___ Python tracker <http://bugs.python.org/issue30011> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30011] HTMLParser class is not thread safe
Alessandro Vesely added the comment: On Fri 14/Apr/2017 19:44:29 +0200 Serhiy Storchaka wrote: > > Changes by Serhiy Storchaka : > > > -- > pull_requests: +1272 Thank you for your fix, Serhiy. It makes the class behave consistently. However, busy processes are going to concurrently build multiple temporary entitydefs objects before one of them wins, which is probably worse than the greedy starting that such lazy initialization tries to avoid in the first place. Doesn't that design deserve a comment in the code, at least? Greetings Ale -- ___ Python tracker <http://bugs.python.org/issue30011> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30011] HTMLParser class is not thread safe
Alessandro Vesely added the comment: Serhiy's analysis is correct. If anything more than a comment is going to make its way to the code, I'd suggest to move dictionary building to its own function, so that it can be called either on first use --like now-- or before threading if the user is concerned. I agree there is nothing wrong with multiple builds. My point is just a minor, bearable inefficiency. It can be neglected. Its most annoying case is probably with test suites, which are more likely to shoot up a bunch of new threads all at once. Greetings Ale -- ___ Python tracker <http://bugs.python.org/issue30011> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29462] RFC822-comments in email header fields can fool, e.g., get_filename()
New submission from Alessandro Vesely: Comments are allowed almost everywhere in an email message, and should be eliminated before attributing any meaning to a field. In the words of RFC5322, any CRLF that appears in FWS is semantically "invisible". In particular, some note that comments can be used to deceive an email filter. For example, like so: Content-Disposition: attachment; filename=''attached%2E"; filename*1*="%62"; filename*2=(fool filters)at (I don't know which, if any, email clients would execute that batch...) Anyway, removing comments is needed for any structured header field. One is usually interested in the unfolded, de-commented value. It is difficult to do correctly, because of nesting and quoting possibilities. This issue seems to be ignored, except for address lists (there is a getcomment() member in AddrlistClass). Why? -- components: email messages: 287119 nosy: ale2017, barry, r.david.murray priority: normal severity: normal status: open title: RFC822-comments in email header fields can fool, e.g., get_filename() type: behavior versions: Python 2.7 ___ Python tracker <http://bugs.python.org/issue29462> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29462] RFC822-comments in email header fields can fool, e.g., get_filename()
Alessandro Vesely added the comment: Neither I found CFWS in rfc2231. In addition, rfc 2045 (Introduction) says that Content-Disposition —where filename is defined— cannot include comments. However, Content-Type can include RFC 822 comments, so the filename should be de-commented in case it is inferred from the name parameter there. I'm rather new to Python, and sticking to version 2 because of the packages I work with. I see Python3's email has a much more robust design. Does this mean Python2 cannot get fixed? I attach a de_comment() function, copied from the one I mentioned this morning. The rest of the file shows its intended use. (Oops, it removes comments even from where they are not supposed to be allowed ;-) Having that kind of functionality in email.utils would make it easier to read Message's, no? -- Added file: http://bugs.python.org/file46551/attachments.py ___ Python tracker <http://bugs.python.org/issue29462> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29462] RFC822-comments in email header fields can fool, e.g., get_filename()
Alessandro Vesely added the comment: We can close this, then. Let's hope migration to Python3 isn't going to last forever... Thank you for your cooperation -- resolution: -> wont fix stage: -> resolved status: open -> closed ___ Python tracker <http://bugs.python.org/issue29462> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com