[issue13703] Hash collision security issue
Changes by Glenn Linderman : -- nosy: +v+python ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Glenn Linderman added the comment: Given Martin's comment (msg150832) I guess I should add my suggestion to this issue, at least for the record. Rather than change hash functions, randomization could be added to those dicts that are subject to attack by wanting to store user-supplied key values. The list so far seems to be urllib.parse, cgi, json Some have claimed there are many more, but without enumeration. These three are clearly related to the documented issue. The technique would be to wrap dict and add a short random prefix to each key value, preventing the attacker from supplier keys that are known to collide... and even if he successfully stumbles on a set that does collide on one request, it is unlikely to collide on a subsequent request with a different prefix string. The technique is fully backward compatible with all applications except those that contain potential vulnerabilities as described by the researchers. The technique adds no startup or runtime overhead to any application that doesn't contain the potential vulnerabilities. Due to the per-request randomization, the complexity of creating a sequence of sets of keys that may collide is enormous, and requires that such a set of keys happen to arrive on a request in the right sequence where the predicted prefix randomization would be used to cause the collisions to occur. This might be possible on a lightly loaded system, but is less likely on a system with heavy load, which are more interesting to attack. Serhiy Storchaka provided a sample implementation on the python-dev, copied below, and attached as a file (but is not a patch). # -*- coding: utf-8 -*- from collections import MutableMapping import random class SafeDict(dict, MutableMapping): def __init__(self, *args, **kwds): dict.__init__(self) self._prefix = str(random.getrandbits(64)) self.update(*args, **kwds) def clear(self): dict.clear(self) self._prefix = str(random.getrandbits(64)) def _safe_key(self, key): return self._prefix + repr(key), key def __getitem__(self, key): try: return dict.__getitem__(self, self._safe_key(key)) except KeyError as e: e.args = (key,) raise e def __setitem__(self, key, value): dict.__setitem__(self, self._safe_key(key), value) def __delitem__(self, key): try: dict.__delitem__(self, self._safe_key(key)) except KeyError as e: e.args = (key,) raise e def __iter__(self): for skey, key in dict.__iter__(self): yield key def __contains__(self, key): return dict.__contains__(self, self._safe_key(key)) setdefault = MutableMapping.setdefault update = MutableMapping.update pop = MutableMapping.pop popitem = MutableMapping.popitem keys = MutableMapping.keys values = MutableMapping.values items = MutableMapping.items def __repr__(self): return '{%s}' % ', '.join('%s: %s' % (repr(k), repr(v)) for k, v in self.items()) def copy(self): return self.__class__(self) @classmethod def fromkeys(cls, iterable, value=None): d = cls() for key in iterable: d[key] = value return d def __eq__(self, other): return all(k in other and other[k] == v for k, v in self.items()) and \ all(k in self and self[k] == v for k, v in other.items()) def __ne__(self, other): return not self == other -- Added file: http://bugs.python.org/file24169/SafeDict.py ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Glenn Linderman added the comment: Alex, I agree the issue has to do with the origin of the data, but the modules listed are the ones that deal with the data supplied by this particular attack. Note that changing the hash algorithm for a persistent process, even though each process may have a different seed or randomized source, allows attacks for the life of that process, if an attack vector can be created during its lifetime. This is not a problem for systems where each request is handled by a different process, but is a problem for systems where processes are long-running and handle many requests. Regarding vulnerable user code, supplying SafeDict (or something similar) in the stdlib or as sample code for use in such cases allows user code to be fixed also. You have entered the class of people that claim lots of vulnerabilities, without enumeration. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Glenn Linderman added the comment: [offlist] Paul, thanks for the enumeration and response. Some folks have more experience, but the rest of us need to learn. Having the proposal in the ticket, with an explanation of its deficiencies is not all bad, however, others can learn, perhaps. On the other hand, I'm willing to learn more, if you are willing to address my concerns below. I had read the whole thread and issue, but it still seemed like a leap of faith to conclude that the only, or at least best, solution is changing the hash. Yet, changing the hash still doesn't seem like a sufficient solution, due to long-lived processes. On 1/7/2012 6:40 PM, Paul McMillan wrote: > Paul McMillan added the comment: > >> Alex, I agree the issue has to do with the origin of the data, but the >> modules listed are the ones that deal with the data supplied by this >> particular attack. > They deal directly with the data. Do any of them pass the data > further, or does the data stop with them? For web forms and requests, which is the claimed vulnerability, I would expect that most of them do not pass the data further, without validation or selection, and it is unlikely that the form is actually expecting data with colliding strings, so it seems very unlikely that they would be passed on. At least that is how I code my web apps: just select the data I expect from my form. At present I do not reject data I do not expect, but I'll have to consider either using SafeDict (which I can start using ASAP, not waiting for a new release of Perl to be installed on my Web Server (currently running Perl 2.4), or rejecting data I do not expect prior to putting it in a dict. That might require tweaking urllib.parse a bit, or cgi, or both. > A short and very incomplete > list of vulnerable standard lib modules includes: every single parsing > library (json, xml, html, plus all the third party libraries that do > that), all of numpy (because it processes data which probably came > from a user [yes, integers can trigger the vulnerability]), difflib, > the math module, most database adaptors, anything that parses metadata > (including commonly used third party libs like PIL), the tarfile lib > along with other compressed format handlers, the csv module, > robotparser, plistlib, argparse, pretty much everything under the > heading of "18. Internet Data Handling" (email, mailbox, mimetypes, > etc.), "19. Structured Markup Processing Tools", "20. Internet > Protocols and Support", "21. Multimedia Services", "22. > Internationalization", TKinter, and all the os calls that handle > filenames. The list is impossibly large, even if we completely ignore > user code. This MUST be fixed at a language level. > > I challenge you to find me 15 standard lib components that are certain > to never handle user-controlled input. I do appreciate your enumeration, but I'll decline the challenge. While all of them can be interesting exploits of naïve applications (written by programmers who may be quite experienced in some things, but can naïvely overlook other things), most of them probably do not apply to the documented vulnerability. Many I had thought of, but rejected for this context; some I had not. So while there are many possible situations where happily stuffing things into a dict may be an easy solution, there are many possible cases where it should be prechecked on the way in. And there is another restriction: if the user-controlled input enters a user-run program, it is unlikely to be attacked in the same manner than web servers are attacked. A user, for example, is unlikely to contrive colliding file names for the purpose of making his file listing program run slow. So it is really system services and web services that need to be particularly careful. Randomizing the hash seed might reduce the problem from any system/web services to only long-running system/web services, but doesn't really solve the complete problem, as far as I can tell... only proper care in writing the application (and the stdlib code) will solve the complete problem. Sadly, beefing up the stdlib code will probably reduce performance for things that will not be exploited to be careful enough in the cases that could be exploited. >> Note that changing the hash algorithm for a persistent process, even though >> each process may have a different seed or randomized source, allows attacks >> for the life of that process, if an attack vector can be created during its >> lifetime. This is not a problem for systems where each request is handled by >> a different process, but is a problem for systems where processes are >> long-running and handle many requests. > This point has been made many times now. I urge you to read the
[issue13703] Hash collision security issue
Glenn Linderman added the comment: I don't find a way to delete my prior comment, so I'll add one more (only). The prior comment was intended to go to one person, but I didn't notice the From, having one person's name, actually went back to the ticket (the email address not being for that individual), now I do, so I've learned that. My prior comment was a request for further explanation of things I still don't understand, not intended to be an attack. If someone can delete both this and my prior comment from the issue, or tell me how, feel free. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Glenn Linderman added the comment: In msg142098 Ezio said: > Keep in mind that we should be able to access and use lone surrogates too, > therefore: > s = '\ud800' # should be valid > len(s) # should this raise an error? (or return 0.5 ;)? I say: For streams and data types in which lone surrogates are permitted, a lone surrogate should be treated as and counted as a character (codepoint). For streams and data types in which lone surrogates are not permitted, the assigned should be invalid, and raise an error; len would then never see it, and has no quandary. -- nosy: +v+python ___ Python tracker <http://bugs.python.org/issue12729> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11269] cgi.FieldStorage forgets to unquote field names when parsing multipart/form-data
Glenn Linderman added the comment: Just some comments for the historical record: During the discussion of issue 4953 research and testing revealed that browsers send back their cgi data using the same charset as the page that they are responding to. So the only way that quoting would be necessary on field names would be if they were quoted funny, as in your example here. It is somewhat unlikely that people would go to the trouble of coding field names that contain " and ' and % characters, just to mess themselves up (which ones do that, depend on which quote character is used for the name in the HTML and whether the enctype is "multipart/form-data" or URL encoding). And Firefox 3.6... provides name=""%22" and that presently works with Python 3.2 CGI! But that might mean that for Firefox 4.x, providing the "\"%22", CGI might pass through the "\"? And really, the dequoting must be incorrectly coded for the Firefox 3.6 to "work". -- nosy: +v+python ___ Python tracker <http://bugs.python.org/issue11269> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11269] cgi.FieldStorage forgets to unquote field names when parsing multipart/form-data
Glenn Linderman added the comment: Sergey says: I wanted to add that the fact that browsers encode the field names in the page encoding does not change that they should escape the header according to RFC 2047. I respond: True, but RFC 2047 is _so_ weird, that it seems that browsers have a better solution. RFC 2047 is needed for 7-bit mail servers, from which it seems to have been inherited by specs for HTTP (but I've never seen it used by a browser, have you?). It would be nicer if HTTP had a header that allowed definition of the charset used for subsequent headers. Right now, the code processing form data has to assume a particular encoding for headers & data, and somehow make sure that all the s that use the same code have the same encoding. Sergey says: I imagine there could be a non-ASCII field name that, when encoded in some encoding, will produce something SQL-injection-like: '"; other="xx"'. That string would make the header parse into something completely different. With IE8 and FF 3.6 it looks like it would be very simple. The same applies to uploaded files names too, so it's not just a matter of choosing sane field names. That's all a browsers' problem though. I respond: Perhaps there is, although it depends on how the parser is written what injection techniques would work, and it also depends on having a followon parameter with dangerous semantics to incorrectly act on. It isn't just a problem for the browsers, but for every CGI script that parses such parameters. -- ___ Python tracker <http://bugs.python.org/issue11269> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1271] Raw string parsing fails with backslash as last character
Glenn Linderman added the comment: I can certainly agree with the opinion that raw strings are working as documented, but I can also agree with the opinion that they contain traps for the unwary, and after getting trapped several times, I have chosen to put up with the double-backslash requirement of regular strings, and avoid the use of raw strings in my code. The double-backslash requirement of regular strings gets ugly for Windows pathnames and some regular expressions, but the traps of raw strings are more annoying that that. I'm quite sure it would be impossible to "fix" raw strings without causing deprecation churn for people to whom they are useful (if there are any such; hard for me to imagine, but I'm sure there are). I'm quite sure the only reasonable "fix" would be to invent a new type of "escape-free" or "exact" string (to not overuse the term raw, and make two types of raw string). With Python 3, and UTF-8 source files, there is little need for \-prefixed characters (and there is already a string syntax that permits them, when they are needed), so it seems like inventing a new string syntax e'string' e"""string""" which would not treat \ in any special manner whatsoever, would be useful for all the cases raw strings are presently useful for, and even more useful, because it would handle all the cases that are presently traps for the unwary that raw-strings have. The problem mention in this thread of escaping the outer quote character is much more appropriately handled by the triple-quote form. I don't know the Python history well enough to know if raw strings predated triple-quote; if they didn't, there is would have been no need for raw strings to attempt to support such. -- nosy: +v+python ___ Python tracker <http://bugs.python.org/issue1271> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1271] Raw string parsing fails with backslash as last character
Glenn Linderman added the comment: @Graham: seems like the two primary gotchas are trailing \ and \" \' not removing the \. The one that tripped me up the most was the trailing \, but I did get hit with \" once. Probably if Python had been my first programming language that used \ escapes, it wouldn't be such a problem, but since it never will be, and because I'm forced to use the others from time to time, still, learning yet a different form of "not-quite raw" string just isn't worth my time and debug efforts. When I first saw it, it sounded more useful than doubling the \s like I do elsewhere, but after repeated trip-ups with trailing \, I decided it wasn't for me. @R David: Interesting description of the parsing/escaping. Sounds like that makes for a cheap parser, fewer cases to handle. But there is little that is hard about escape-free or exact string parsing: just look for the trailing " ' """ or ''' that matches the one at the beginning. The only thing difficult is if you want to escape the quote, but with the rich set of quotes available, it is extremely unlikely that you can't find one that you can use, except perhaps if you are writing a parser for parsing Python strings, in which case, the regular expression that matches any leading quote could be expressed as: '("|"""|' "'|''')" Granted that isn't the clearest syntax in the world, but it is also uncommon, and can be assigned to a nicely named variable such as matchLeadingQuotationRE in one place, and used wherever needed. Regarding the use of / rather that \ that is true if you are passing file names to Windows APIs, but not true if you are passing them to other programs that use / as option syntax and \ as path separator (most Windows command line utilities). -- ___ Python tracker <http://bugs.python.org/issue1271> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1271] Raw string parsing fails with backslash as last character
Glenn Linderman added the comment: On 3/12/2011 7:11 PM, R. David Murray wrote: > R. David Murray added the comment: > > I've opened issue 11479 with a proposed patch to the tutorial along the lines > suggested by Graham. Which is good, for people that use the tutorial. I jump straight to the reference guide, usually, because of so many years of experience with other languages. But I was surprised you used .strip() instead of [:-1] which is shorter and I would expect it to be more efficient also. -- Added file: http://bugs.python.org/file21098/unnamed ___ Python tracker <http://bugs.python.org/issue1271> ___ On 3/12/2011 7:11 PM, R. David Murray wrote: R. David Murray mailto:rdmur...@bitdance.com";><rdmur...@bitdance.com> added the comment: I've opened issue 11479 with a proposed patch to the tutorial along the lines suggested by Graham. Which is good, for people that use the tutorial. I jump straight to the reference guide, usually, because of so many years of experience with other languages. But I was surprised you used .strip() instead of [:-1] which is shorter and I would expect it to be more efficient also. ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1602] windows console doesn't print or input Unicode
Glenn Linderman added the comment: Presently, a correct application only needs to flush between a sequence of writes and a sequence of buffer.writes. Don't assume the flush happens after every write, for a correct application. -- ___ Python tracker <http://bugs.python.org/issue1602> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1602] windows console doesn't print or input Unicode
Glenn Linderman added the comment: Would it suffice if the new scheme internally flushed after every buffer.write? It wouldn't be needed after write, because the correct application would already do one there? Am I off-base in supposing that the performance of buffer.write is expected to include a flush (because it isn't expected to be buffered)? -- ___ Python tracker <http://bugs.python.org/issue1602> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1602] windows console doesn't print or input Unicode
Glenn Linderman added the comment: David-Sarah said: In any case, given that the buffer of the initial std{out,err} will always be a BufferedWriter object (since .buffer is readonly), it would be possible for the TextIOWriter to test a dirty flag in the BufferedWriter, in order to check efficiently whether the buffer needs flushing on each write. I've looked at the implementation complexity cost of this, and it doesn't seem too bad. So if flush checks that bit, maybe TextIOWriter could just call buffer.flush, and it would be fast if clean and slow if dirty? Calling it at the beginning of a Text level write, that is, which would let the char-at-a-time calls to buffer.write be fast. And I totally agree with msg132191 -- ___ Python tracker <http://bugs.python.org/issue1602> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1602] windows console doesn't print or input Unicode
Glenn Linderman added the comment: David-Sarah wrote: Windows is very slow at scrolling a console, which might make the cost of flushing insignificant in comparison.) Just for the record, I noticed a huge speedup in Windows console scrolling when I switched from WinXP to Win7 on a faster computer :) How much is due to the XP->7 switch and how much to the faster computer, I cannot say, but it seemed much more significant than other speedups in other software. The point? Benchmark it on Win7, not XP. -- ___ Python tracker <http://bugs.python.org/issue1602> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11945] Adopt and document consistent semantics for handling NaN values in containers
Glenn Linderman added the comment: Bertrand Meyer's exposition is flowery, and he is a learned man, but the basic argument he makes is: Reflexivity of equality is something that we expect for any data type, and it seems hard to justify that a value is not equal to itself. As to assignment, what good can it be if it does not make the target equal to the source value? The argument is flawed: now that NaN exists, and is not equal to itself in value, there should be, and need be, no expectation that assignment elsewhere should make the target equal to the source in value. It can, and in Python, should, make them match in identity (is) but not in value (==, equality). I laud the idea of adding to definition of reflexive equality to the glossary. However, I think it is presently a bug that a list containing a NaN value compares equal to itself. Yes, such a list should have the same identity (is), but should not be equal. -- nosy: +v+python ___ Python tracker <http://bugs.python.org/issue11945> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11945] Adopt and document consistent semantics for handling NaN values in containers
Glenn Linderman added the comment: Nick says (and later explains better what he meant): The status quo works. Proposals to change it on theoretical grounds have a significantly higher bar to meet than proposals to simply document it clearly. I say: What the status quo doesn't provide is containers that "work". In this case what I mean by "work" is that equality of containers is based on value, and value comparisons, and accept and embrace non-reflexive equality. It might be possible to implement alternate containers with these characteristics, but that requires significantly more effort than simply filtering values. Nonetheless, I totally agree with msg134654, and agree that properly documenting the present implementation would be a great service to users of the present implementation. -- ___ Python tracker <http://bugs.python.org/issue11945> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Regarding http://bugs.python.org/issue4953#msg91444 POST with multipart/form-data encoding can use UTF-8, other stuff is restricted to ASCII! >From http://www.w3.org/TR/html401/interact/forms.html: Note. The "get" method restricts form data set values to ASCII characters. Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire [ISO10646] character set. Hence cgi formdata can safely decode text fields using UTF-8 decoding (experimentally, that is the encoding used by Firefox to support the entire ISO10646 character set). -- nosy: +v+python ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10479] cgitb.py should assume a binary stream for output
New submission from Glenn Linderman : The CGI interface is a binary stream, because it is pumped directly to/from the HTTP protocol, which is a binary stream. Hence, cgitb.py should produce binary output. Presently, it produces text output. When one sets stdout to a binary stream, and then cgitb intercepts an error, cgitb fails. Demonstration of problem: import sys import traceback sys.stdout = open("sob", "wb") # WSGI sez data should be binary, so stdout should be binary??? import cgitb sys.stdout.write(b"out") fhb = open("fhb", "wb") cgitb.enable() fhb.write("abcdef") # try writing non-binary to binary file. Expect an error, of course. -- components: Unicode messages: 121865 nosy: v+python priority: normal severity: normal status: open title: cgitb.py should assume a binary stream for output versions: Python 3.2 ___ Python tracker <http://bugs.python.org/issue10479> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10480] cgi.py should document the need for binary stdin/stdout
New submission from Glenn Linderman : CGI is a bytestream protocol. Python assumes a text mode encoding for stdin and stdout, this is inappropriate for the CGI interface. CGI should provide an API to "do the right thing" to make stdin and stout binary mode interfaces (including mscvrt setting to binary on Windows). Failing that, it should document the need to do so in CGI applications. Failing that, it should be documented somewhere, CGI seems the most appropriate place to me. -- components: Library (Lib) messages: 121868 nosy: v+python priority: normal severity: normal status: open title: cgi.py should document the need for binary stdin/stdout versions: Python 3.2 ___ Python tracker <http://bugs.python.org/issue10480> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10481] subprocess PIPEs are byte streams
New submission from Glenn Linderman : While http://bugs.python.org/issue2683 did clarify the fact that the .communicate API takes a byte stream as input, it is easy to miss the implication. Because Python programs start up with stdin as a text stream, it might be good to point out that some action may need to be taken to be sure that the receiving program expects a byte stream, or that the byte stream supplied should be in an encoding that the receiving program is expecting and can decode appropriately. No mention is presently made in the documentation for .communicate that its output is also a byte stream, and if text will correspond to whatever encoding is used by the sending program. -- assignee: d...@python components: Documentation messages: 121869 nosy: d...@python, v+python priority: normal severity: normal status: open title: subprocess PIPEs are byte streams versions: Python 3.2 ___ Python tracker <http://bugs.python.org/issue10481> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10481] subprocess PIPEs are byte streams
Glenn Linderman added the comment: Maybe it should also be mentioned that p.stdout and p.stderr and p.stdin, when set to be PIPEs, are also byte streams. Of course that is the reason that communicate accepts and produces byte streams. -- ___ Python tracker <http://bugs.python.org/issue10481> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10482] subprocess and deadlock avoidance
New submission from Glenn Linderman : .communicate is a nice API for programs that produce little output, and can be buffered. While that may cover a wide range of uses, it doesn't cover launching CGI programs, such as is done in http.server. Now there are nice warnings about that issue in the http.server documentation. However, while .communicate has the building blocks to provide more general solutions, it doesn't expose them to the user, nor does it separate them into building blocks, rather it is a monolith inside ._communicate. For example, it would be nice to have an API that would "capture a stream using a thread" which could be used for either stdout or stdin, and is what ._communicate does under the covers for both of them. It would also be nice to have an API that would "pump a bytestream to .stdin as a background thread. ._communicate doesn't provide that one, but uses the foreground thread for that. And, it requires that it be fully buffered. It would be most useful for http.server if this API could connect a file handle and an optional maximum read count to .stdin, yet do it in a background thread. That would leave the foreground thread able to process stdout. It is correct (but not what http.server presently does, but I'll be entering that enhancement request soon) for http.server to read the first line from the CGI program, transform it, add a few more headers, and send that to the browser, and then hook up .stdout to the browser (shutil.copyfileobj can be used for the rest of the stream). However, there is no deadlock free way of achieving this sort of solution, capturing the stderr to be logged, not needing to buffer a potentially large file upload, and transforming the stdout, with the facilities currently provided by subprocess. Breaking apart some of the existing building blocks, and adding an additional one for .stdin processing would allow a real http.server implementation, as well as being more general for other complex uses. You see, for http.server, the stdin -- components: Library (Lib) messages: 121871 nosy: v+python priority: normal severity: normal status: open title: subprocess and deadlock avoidance type: feature request versions: Python 3.2 ___ Python tracker <http://bugs.python.org/issue10482> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10483] http.server - what is executable on Windows
New submission from Glenn Linderman : The def executable for CGIHTTPRequestHandler is simply wrong on Windows. The Unix executable bits do not apply. Yet it is not clear what to use instead. One could check the extension against PATHEXT, perhaps, but Windows doesn't limit itself to that except when not finding the exactly specified executable name. Or one could require and borrow the unix #! convention. As an experiment, since I'm mostly dealing the script files, I tried out a hack that implements two #! lines, the first for Unix and the second for Windows, and only consider something executable if the second line exists. This fails miserably for .exe files, of course. Another possibility would be to see if there is an association for the extension, but that rule would permit a Word document to be "executable" because there is a way to open it using MS Word. Another possibility would be to declare a list of extensions in the server source, like the list of directories from which CGIs are found. Another possibility would be to simply assume that anything found in the CGI directory is executable. Another possibility is to require the .cgi extension only to be executable, but then it is hard to know how to run it. Another possibility is to require two "extensions"... the "real" one for Windows, and then .cgi just before it. So to make a program executable, it would be renamed from file.ext to file.cgi.ext But the current technique is clearly insufficient. -- components: Library (Lib) messages: 121875 nosy: v+python priority: normal severity: normal status: open title: http.server - what is executable on Windows type: behavior versions: Python 3.2 ___ Python tracker <http://bugs.python.org/issue10483> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10484] http.server.is_cgi fails to handle CGI URLs containing PATH_INFO
New submission from Glenn Linderman : is_cgi doesn't properly handle PATH_INFO parts of the path. The Python2.x CGIHTTPServer.py had this right, but the introduction and use of _url_collapse_path_split broke it. _url_collapse_path_split splits the URL into a two parts, the second part is guaranteed to be a single path component, and the first part is the rest. However, URLs such as /cgi-bin/foo.exe/this/is/PATH_INFO/parameters can and do want to exist, but the code in is_cgi will never properly detect that /cgi-bin/foo.exe is the appropriate executable, and the rest should be PATH_INFO. This used to work correctly in the precedecessor CGIHTTPServer.py code in Python 2.6, so is a regression. -- components: Library (Lib) messages: 121876 nosy: v+python priority: normal severity: normal status: open title: http.server.is_cgi fails to handle CGI URLs containing PATH_INFO type: behavior versions: Python 3.2 ___ Python tracker <http://bugs.python.org/issue10484> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10485] http.server fails when query string contains addition '?' characters
New submission from Glenn Linderman : http.server on Python 3 and CGIHTTPServer on Python 2 both contain the same code with the same bug. In run_cgi, rest.rfind('?') is used to separate the path from the query string. However, it should be rest.find('?') as the query string starts with '?' but may also contain '?'. It is required that '?' not be used in URL path part without escaping. Apache, for example, separates the following URL: /testing?foo=bar?&baz=3 into path part /testing and query string part foo=bar?&baz=3 but http.server does not. -- components: Library (Lib) messages: 121877 nosy: v+python priority: normal severity: normal status: open title: http.server fails when query string contains addition '?' characters versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2 ___ Python tracker <http://bugs.python.org/issue10485> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10485] http.server fails when query string contains addition '?' characters
Changes by Glenn Linderman : -- type: -> behavior ___ Python tracker <http://bugs.python.org/issue10485> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10486] http.server doesn't set all CGI environment variables
New submission from Glenn Linderman : HTTP_HOST HTTP_PORT REQUEST_URI are variables that my CGI scripts use, but which are not available from http.server or CGIHTTPServer (until I added them). There may be more standard variables that are not set, I didn't attempt to enumerate the whole list. -- components: Library (Lib) messages: 121878 nosy: v+python priority: normal severity: normal status: open title: http.server doesn't set all CGI environment variables type: behavior versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2 ___ Python tracker <http://bugs.python.org/issue10486> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10487] http.server - doesn't process Status: header from CGI scripts
New submission from Glenn Linderman : While it is documented that http.server (and Python 2's CGIHTTPServer) do not process the status header, and limit the usefulness of CGI scripts as a result, that doesn't make it less of a bug, just a documented bug. But I guess that it might have to be called a feature request; I'll not argue if someone switches this to feature request, but I consider it a bug. See related issue 10482 for subprocess to provide better features for avoiding deadlock situations. There seems to be no general way using subprocess to avoid possible deadlock situations. However, since CGI doesn't really use stderr much, and only for logging, which the scripts can do themselves (the cgi.py module even provides for such), and because CGIs generally slurp stdin before creating stdout, it is possible to tweak sidestep use of subprocess.communicate, drop the stdout PIPE, and sequence the code to process stdin and then stdout, and not generally deadlock (some CGI scripts that don't above the stdin before stdout rule, might deadlock if called with POST and large inputs, but those are few). By doing this, one can then add code to handle Status: headers, and avoid buffering large files on output (and on input). The tradeoff is losing the stderr log; when that is hooked up, some error cases can trigger deadlocks by writing to stderr -- hence the subprocess issue mentioned above. -- components: Library (Lib) messages: 121881 nosy: v+python priority: normal severity: normal status: open title: http.server - doesn't process Status: header from CGI scripts type: behavior versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2 ___ Python tracker <http://bugs.python.org/issue10487> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10479] cgitb.py should assume a binary stream for output
Changes by Glenn Linderman : -- type: -> behavior ___ Python tracker <http://bugs.python.org/issue10479> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10480] cgi.py should document the need for binary stdin/stdout
Changes by Glenn Linderman : -- assignee: -> d...@python components: +Documentation nosy: +d...@python type: -> behavior ___ Python tracker <http://bugs.python.org/issue10480> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10483] http.server - what is executable on Windows
Glenn Linderman added the comment: Martin, that is an interesting viewpoint, and one I considered, but didn't state, because it seems much too restrictive. Most CGI programs are written in scripting languages, not compiled to .exe. So it seems the solution should allow for launching at least Perl and Python scripts, as well as .exe. Whether subprocess.Popen can directly execute it, or whether it needs help from the registry or a #! line to get the execution going is just a matter of tweaking the coding for what gets passed to subprocess.Popen. Declaring the definition based on what the existing code can already do is self-limiting. Another possible interpretation of executable might be PATHEXT environment variable, but that is similar to declaring a list in the server source, which I did mention. One might augment the other. -- ___ Python tracker <http://bugs.python.org/issue10483> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10483] http.server - what is executable on Windows
Glenn Linderman added the comment: The rest of the code has clearly never had its deficiencies exposed on Windows, simply because executable() has prevented that. So what the rest of the code "already supports" is basically nothing. Reasonable Windows support is appropriate to implement as part of the bugfix. You state that it isn't the function of http.server to extend Windows, however, even MS IIS has extended Windows to provide reasonable web scripting functionality, albeit it its own way, thus convicting the Windows facilities of being insufficient. Attempting to use http.server to get a web testing environment running so that Python scripts can be tested locally requires some way of using an existing environment (except, of course, for "all new" web sites). I suppose you would claim that using http.server for a web testing environment is an inappropriate use of http.server, also. Yet http.server on Unix appears to provide an adequate web testing environment: yes, some of that is because of Unix's #! feature. This would certainly not be the first case where more code is required on Windows than Unix to implement reasonable functionality. My desire for support for Perl is not an attempt to convince Python developers to use Perl instead of Python, but simply a reflection of the practicality of life: There are a lot of Perl CGI scripts used for pieces of Web servers. Reinventing them in Python may be fun, but can be more time consuming than projects may have the luxury to do. Your claim that it already supports Python CGI scripts must be tempered by the documentation claim that it provides "altered semantics". "altered semantics", as best as I can read in the code, is that the query string is passed to the Python script as a command line if it doesn't happen to contain an "=" sign. This is weird, unlikely to be found in a real web server, and hence somewhat useless for use as a test server also. http.server has chosen to use subprocess which has chosen to use CreateProcess as its way of executing CGI. There are other Windows facilities for executing programs, such as ShellExecute, but of course it takes the opposite tack: it can "execute" nearly any file, via registry-based associations. Neither of these seem to be directly appropriate for use by http.server, the former being too restrictive without enhancements, the latter being too liberal in executing too many file types, although the requirement that CGI scripts live in specific directories may sufficiently rein in that liberality. However, you have made me think through the process: it seems that an appropriate technique for Windows is to allow for a specific set of file extensions, and permit them to be executed using the registry-based association to do so. However, for .cgi files, which depend heavily on the Unix #!, emulation of #! seems appropriate (and Windows doesn't seem to have an association for .cgi files either). Your suggestion of making CGIHTTPRequestHandler easier to subclass is certainly a good one, and is almost imperative to implement to fix this bug in a useful manner without implementing an insufficient set of Windows extensions (for someone's definition of wrong). There should be a way to sidestep the "altered semantics" for Python scripts (and Python scripts shouldn't have to be a special case, they should work with the general case), without replacing the whole run_cgi() function. There should be a hook to define the list of executable extensions, and how to run them, and/or a hook to alter the command line passed to subprocess.Popen to achieve same. So is_executable and is_python both seem to currently be replacable. What is missing is a hook to implement cmdline creation before calling subprocess.Popen() (besides the other reported bugs, of course) -- ___ Python tracker <http://bugs.python.org/issue10483> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10483] http.server - what is executable on Windows
Glenn Linderman added the comment: Martin, you are splitting hairs about the "reported problem". The original message does have a paragraph about the executable bits being wrong. But the bulk of the message is commenting about the difficulty of figuring out what to replace it with. So it looks like in spite of the hair splitting, we have iterated to a design of making run_cgi a bit friendlier in this regard. I find it sufficient to define a method fully extracted from run_cgi as follows: def make_cmdline( self, scriptfile, query ): cmdline = [scriptfile] if self.is_python(scriptfile): interp = sys.executable if interp.lower().endswith("w.exe"): # On Windows, use python.exe, not pythonw.exe interp = interp[:-5] + interp[-4:] cmdline = [interp, '-u'] + cmdline if '=' not in query: cmdline.append(query) This leaves run_cgi with: import subprocess cmdline = self.make_cmdline( scriptfile, query ) self.log_message("command: %s", subprocess.list2cmdline(cmdline)) Apologies: I don't know what format of patch is acceptable, but this is a simple cut-n-paste change. I was sort of holding off until the hg conversion to figure out how to make code submissions, since otherwise I'd have to learn it twice in short order. I have reimplemented my work-arounds in terms of the above fix, and they function correctly, so this fix would suffice for me, for this issue. (N.B. I'm sure you've noticed that I have entered a number of issues for http.server; I hope that was the right way to do it, to attempt to separate the issues.) -- ___ Python tracker <http://bugs.python.org/issue10483> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10486] http.server doesn't set all CGI environment variables
Glenn Linderman added the comment: Took a little more time to do a little more analysis on this one. Compared a sample query via Apache on Linux vs http.server, then looked up the CGI RFC for more info: DOCUMENT_ROOT: ... GATEWAY_INTERFACE: CGI/1.1 HTTP_ACCEPT: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 HTTP_ACCEPT_CHARSET: ISO-8859-1,utf-8;q=0.7,*;q=0.7 HTTP_ACCEPT_ENCODING: gzip,deflate HTTP_ACCEPT_LANGUAGE: en-us,en;q=0.5 HTTP_CONNECTION: keep-alive HTTP_COOKIE: ... HTTP_HOST: ... HTTP_KEEP_ALIVE: 115 HTTP_USER_AGENT: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10 PATH: /usr/local/bin:/usr/bin:/bin PATH_INFO: ... PATH_TRANSLATED: ... QUERY_STRING: REMOTE_ADDR: 173.75.100.22 REMOTE_PORT: 50478 REQUEST_METHOD: GET REQUEST_URI: ... SCRIPT_FILENAME: ... SCRIPT_NAME: ... SERVER_ADDR: ... SERVER_ADMIN: ... SERVER_NAME: ... SERVER_PORT: ... SERVER_PROTOCOL: HTTP/1.1 SERVER_SIGNATURE: Apache Server at rkivs.com Port 80 SERVER_SOFTWARE: Apache UNIQUE_ID: TLEs8krc24oAABQ1TIUAAAPN Above from Apache, below from http.server GATEWAY_INTERFACE: CGI/1.1 HTTP_USER_AGENT: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12 PATH_INFO: ... PATH_TRANSLATED: ... QUERY_STRING: ... REMOTE_ADDR: 127.0.0.1 REQUEST_METHOD: GET SCRIPT_NAME: ... SERVER_NAME: ... SERVER_PORT: ... SERVER_PROTOCOL: HTTP/1.0 SERVER_SOFTWARE: SimpleHTTP/0.6 Python/3.2a4 Analysis of missing variables between Apache and http.server: DOCUMENT_ROOT HTTP_ACCEPT HTTP_ACCEPT_CHARSET HTTP_ACCEPT_ENCODING HTTP_ACCEPT_LANGUAGE HTTP_CONNECTION HTTP_COOKIE HTTP_HOST HTTP_KEEP_ALIVE HTTP_PORT PATH REQUEST_URI SCRIPT_FILENAME SERVER_ADDR SERVER_ADMIN Additional variables mentioned in RFC 3875, not used for my test requests: AUTH_TYPE CONTENT_LENGTH CONTENT_TYPE REMOTE_IDENT REMOTE_USER -- ___ Python tracker <http://bugs.python.org/issue10486> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10484] http.server.is_cgi fails to handle CGI URLs containing PATH_INFO
Glenn Linderman added the comment: Here is a replacement for the body of is_cgi that will work with the current _url_collapse_path_split function, but it seems to me that it is ineffecient to do multiple splits and joins of the path between the two functions. splitpath = server._url_collapse_path_split(self.path) # more processing required due to possible PATHINFO parts # not clear above function really does what is needed here, # nor just how general it is! splitpath = '/'.join( splitpath ).split('/', 2 ) head = '/' + splitpath[ 1 ] tail = splitpath[ 2 ] if head in self.cgi_directories: self.cgi_info = head, tail return True return False -- ___ Python tracker <http://bugs.python.org/issue10484> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10482] subprocess and deadlock avoidance
Glenn Linderman added the comment: So I've experimented a bit, and it looks like simply exposing ._readerthread as an external API would handle the buffered case for stdout or stderr. For http.server CGI scripts, I think it is fine to buffer stderr, as it should not be a high-volume channel... but not both stderr and stdout, as stdout can be huge. And not stdin, because it can be huge also. For stdin, something like the following might work nicely for some cases, including http.server (with revisions): def _writerthread(self, fhr, fhw, length): while length > 0: buf = fhr.read( min( 8196, length )) fhw.write( buf ) length -= len( buf ) fhw.close() When the stdin data is buffered, but the application wishes to be stdout centric instead of stdin centric (like the current ._communicate code), a variation could be made replacing fhr by a data buffer, and writing it gradually (or fully) to the pipe, but from a secondary thread. Happily, this sort of code (the above is extracted from a test version of http.server) can be implemented in the server, but would be more usefully provided by subprocess, in my opinion. To include the above code inside subprocess would just be a matter of tweaking references to class members instead of parameters. -- ___ Python tracker <http://bugs.python.org/issue10482> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10482] subprocess and deadlock avoidance
Glenn Linderman added the comment: Here's an updated _writerthread idea that handles more cases: def _writerthread(self, fhr, fhw, length=None): if length is None: flag = True while flag: buf = fhr.read( 512 ) fhw.write( buf ) if len( buf ) == 0: flag = False else: while length > 0: buf = fhr.read( min( 512, length )) fhw.write( buf ) length -= len( buf ) # throw away additional data [see bug #427345] while select.select([fhr._sock], [], [], 0)[0]: if not fhr._sock.recv(1): break fhw.close() -- ___ Python tracker <http://bugs.python.org/issue10482> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10482] subprocess and deadlock avoidance
Glenn Linderman added the comment: Sorry, left some extraneous code in the last message, here is the right code: def _writerthread(self, fhr, fhw, length=None): if length is None: flag = True while flag: buf = fhr.read( 512 ) fhw.write( buf ) if len( buf ) == 0: flag = False else: while length > 0: buf = fhr.read( min( 512, length )) fhw.write( buf ) length -= len( buf ) fhw.close() -- ___ Python tracker <http://bugs.python.org/issue10482> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10487] http.server - doesn't process Status: header from CGI scripts
Glenn Linderman added the comment: Just to mention, with the added code from issue 10482, I was able to get a 3-stream functionality working great in http.server and also backported it to 2.6 CGIHTTPServer... and to properly process the Status: header on stdout. Works very well in 2.6; Issue 8077 prevents form processing from working in 3.2a4, but otherwise it is working there also, and the experience in 2.6 indicates that once issue 8077 is resolved, it should work in 3.2 also. -- ___ Python tracker <http://bugs.python.org/issue10487> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10482] subprocess and deadlock avoidance
Glenn Linderman added the comment: Looking at the code the way I've used it in my modified server.py: stderr = [] stderr_thread = threading.Thread(target=self._readerthread, args=(p.stderr, stderr)) stderr_thread.daemon = True stderr_thread.start() self.log_message("writer: %s" % str( nbytes )) stdin_thread = threading.Thread(target=self._writerthread, args=(self.rfile, p.stdin, nbytes)) stdin_thread.daemon = True stdin_thread.start() and later stderr_thread.join() stdin_thread.join() p.stderr.close() p.stdout.close() if stderr: stderr = stderr[ 0 ].decode("UTF-8") It seems like this sort of code (possibly passing in the encoding) could be bundled back inside subprocess (I borrowed it from there). It also seems from recent discussion on npopdev that the cheat-sheet "how to replace" other sys and os popen functions would be better done as wrapper functions for the various cases. Someone pointed out that the hard cases probably aren't cross-platform, but that currently the easy cases all get harder when using subprocess than when using the deprecated facilities. They shouldn't. The names may need to be a bit more verbose to separate the various use cases, but each use case should remain at least as simple as the prior function. So perhaps instead of just subprocess.PIPE to select particular handling for stdin, stdout, and stderr, subprocess should implement some varieties to handle attaching different types of reader and writer threads to the handles... of course, parameters need to come along for the ride too: maybe the the additional variations would be object references with parameters supplied, instead of just a manifest constant like .PIPE. -- ___ Python tracker <http://bugs.python.org/issue10482> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Pierre, thanks for your work on this. I hope a fix can make it in to 3.2. However, while starting Python with -u can help a but, that should not, in my opinion, be requirement to use CGI. Rather, the stdin should be set into binary mode by the CGI processing... it would be helpful if the CGI module either did it automatically, verified it has been done, or at least provided a helper function that could do it, and that appropriate documentation be provided, if it is not automatic. I've seen code like: try: # Windows needs stdio set for binary mode. import msvcrt msvcrt.setmode (0, os.O_BINARY) # stdin = 0 msvcrt.setmode (1, os.O_BINARY) # stdout = 1 msvcrt.setmode (2, os.O_BINARY) # stderr = 2 except ImportError: pass and if hasattr( sys.stdin, 'buffer'): sys.stdin = sys.stdin.buffer which together, seem to do the job. For output, I use a little class that accepts either binary or text, encoding the latter: class IOMix(): def __init__( self, fh, encoding="UTF-8"): if hasattr( fh, 'buffer'): self._bio = fh.buffer fh.flush() self._last = 'b' import io self._txt = io.TextIOWrapper( self.bio, encoding, None, '\r\n') self._encoding = encoding else: raise ValueError("not a buffered stream") def write( self, param ): if isinstance( param, str ): self._last = 't' self._txt.write( param ) else: if self._last == 't': self._txt.flush() self._last = 'b' self._bio.write( param ) def flush( self ): self._txt.flush() def close( self ): self.flush() self._txt.close() self._bio.close() sys.stdout = IOMix( sys.stdout, encoding ) sys.stderr = IOMix( sys.stderr, encoding ) IOMix may need a few more methods for general use, "print" comes to mind, for example. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Regarding the use of detach(), I don't know if it works. Maybe it would. I know my code works, because I have it working. But if there are simpler solutions that are shown to work, that would be great. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Peter, it seems that detach is relatively new (3.1) likely the code samples and suggestions that I had found to cure the problem predate that. While I haven't yet tried detach, your code doesn't seem to modify stdin, so are you suggesting, really... sys.stdin = sys.stdin.detach() or maybe if hasattr( sys.stdin, 'detach'): sys.stdin = sys.stdin.detach() On the other hand, if detach, coded as above, is equivalent to if hasattr( sys.stdin, 'buffer'): sys.stdin = sys.stdin.buffer then I wonder why it was added. So maybe I'm missing something in reading the documentation you pointed at, and also that at http://docs.python.org/py3k/library/io.html#io.TextIOBase.detach both of which seem to be well-documented if you already have an clear understanding of the layers in the IO subsystem, but perhaps not so well-documented if you don't yet (and I don't). But then you referred to the platform-dependent stuff... I don't see anything in the documentation for detach() that implies that it also makes the adjustments needed on Windows to the C-runtime, which is what the platform-dependent stuff I suggested does... if it does, great, but a bit more documentation would help in understanding that. And if it does, maybe that is the difference between the two code fragments in this comment? I would have to experiment to find out, and am not in a position to do that this moment. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Rereading the doc link I pointed at, I guess detach() is part of the new API since 3.1, so doesn't need to be checked for in 3.1+ code... but instead, may need to be coded as: try: sys.stdin = sys.stdin.detach() except UnsupportedOperation: pass -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: So then David, is your suggestion to use sys.stdin = sys.stdin.detach() and you claim that the Windows-specific hacks are not needed in 3.x land? The are, in 2.x land, I have proven empirically, but haven't been able to test CGI forms very well in 3.x because of this bug. I will test 3.x download without the Windows-specific hack, and report how it goes. My testing started with 2.x and has proceeded to 3.x, and it is not always obvious what hacks are no longer needed in 3.x. Thanks for the info. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: David, Starting from a working (but hacked to work) version of http.server and using 3.2a1 (I should upgrade to the Beta, but I doubt it makes a difference at the moment), I modified # if hasattr( sys.stdin, 'buffer'): # sys.stdin = sys.stdin.buffer sys.stdin = sys.stdin.detach() and it all kept working. Then I took out the try: # Windows needs stdio set for binary mode. import msvcrt msvcrt.setmode (0, os.O_BINARY) # stdin = 0 msvcrt.setmode (1, os.O_BINARY) # stdout = 1 msvcrt.setmode (2, os.O_BINARY) # stderr = 2 except ImportError: pass and it quit working. Seems that \r\r\n\r\r\n is not recognized by Firefox as the "end of the headers" delimiter. Whether this is a bug in IO or not, I can't say for sure. It does seem, though, that 1) If Python is fully replacing the IO layers, which in 3.x it seems to claim to, then it should fully replace them, building on a binary byte stream, not a "binary byte stream with replacement of \n by \r\n". The Windows hack above replaces, for stdin, stdout, and stderr, a "binary byte stream with replacement of \n by \r\n" with a binary byte stream. Seems like Python should do that, on Windows, so that it has a chance of actually knowing/controlling what gets generated. Perhaps it does, if started with "-u", but starting with "-u" should not be a requirement for a properly functioning program. Alternately, the IO streams could understand, and toggle the os.O_BINARY flag, but that seems like it would require more platform-specific code than simply opening all Windows files (and adjusting preopened Windows files) during initialization. 2) The weird CGI processing that exists in the released version of http.server seems to cover up this problem, partly because it isn't very functional, claims "alternate semantics" (read: non-standard semantics), and invokes Python with -u when it does do so. It is so non-standard that it isn't clear what should or should not be happening. But the CGI scripts I am running, that pass or fail as above, also run on Windows 2.6, and particularly, Unix 2.6, in an Apache environment. So I have been trying to minimize the differences to startup code, rather than add platform-specific tweaks throughout the CGI scripts. That said, it clearly could be my environment, but I've debugged enough different versions of things to think that the Windows hack above is required on both 2.x and 3.x to ensure proper bytestreams and others must think so too, because I found the code by searching on Google, not because I learned enough Python internals to figure it out on my own. The question I'm attempting to address here, is only that 3.x still needs the same hack that 2.x needs, on Windows, to create bytestreams. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: (and I should mention that all the "hacked to work" issues in my copy of http.server have been reported as bugs, on 2010-11-21. The ones of most interest related to this binary bytestream stuff are issue 10479 and issue 10480) -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: R. David said: >From looking over the cgi code it is not clear to me whether Pierre's approach >is simpler or more complex than the alternative approach of starting with >binary input and decoding as appropriate. From a consistency perspective I >would prefer the latter, but I don't know if I'll have time to try it out >before rc1. I say: I agree with R. David that an approach using the binary input seems more appropriate, as the HTTP byte stream is defined as binary. Do the 3.2 beta email docs now include documentation for the binary input interfaces required to code that solution? Or could you provide appropriate guidance and review, should someone endeavor to attempt such a solution? The remaining concerns below are only concerns; they may be totally irrelevant, and I'm too ignorant of how the code works to realize their irrelevance. Hopefully someone that understands the code can comment and explain. I believe that the proper solution is to make cgi work if sys.stdin has already been converted to be a binary stream, or if it hasn't, to dive down to the underlying binary stream, using detach(). Then the data should be processed as binary, and decoded once, when the proper decoding parameters are known. The default encoding seems to be different on different platforms, but the binary stream is standardized. It looks like new code was added to attempt to preprocess the MIME data into chunks to be fed to the email parser, and while I can believe code could be written to do such correctly (but I can't speak for whether this patch code is correct or not), it seems redundant/inefficient and error-prone to do it once outside the email parser, and again inside it. I also doubt that self.fp.encoding is consistent from platform to platform). But the HTTP bytestream is binary, and self-describing or declared by HTTP or HTML standards for the parts that are not self-describing. The default platform encoding used for the preopened sys.stdin is not particularly relevant and may introduce mojibake type bugs, decoding errors in the presence of some inputs, and/or platform inconsistencies, and it seems that that is generally where self.fp.encoding, used in various places in this patch, comes from. Regarding the binary vs. text issue; when using both binary and text interfaces on output streams, there is the need to do flushing between text and binary writes to preserve the proper sequencing of data in the output. For input, is it possible that mixing text and binary input could result in the binary input missing data that has already been preloaded into the text buffer? Although, for CGI programs, no one should have done any text inputs before calling the CGI functions, so perhaps this is also not a concern... and there probably isn't any buffering on socket streams (the usual CGI use case) but I see the use of both binary and text input functions in this patch, so this may be another issue that someone could explain why such a mix is or isn't a problem. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: R. David said: (I believe http uses latin-1 when no charset is specified, but I need to double check that) See http://bugs.python.org/issue4953#msg121864 ASCII and UTF-8 are what HTTP defines. Some implementations may, in fact, use latin-1 instead of ASCII in some places. Not sure if we want Python CGI to do that or not. Thanks for getting the email APIs in the docs... shouldn't have to bug you as much that way :) Antoine said: (this is all funny in the light of the web-sig discussion where people explain that CGI is such a natural model) Thanks for clarifying the stdin buffering vs. binary issue... it is as I suspected. Maybe you can also explain the circumstances in which "my" Windows code is needed, and whether Python's "-u" does it automatically, but I still believe that "-u" shouldn't be necessary for a properly functioning program, not even a CGI program... it seems like a hack to allow some programs to work without other changes, so might be a useful feature, but hopefully not a required part of invoking a CGI program. The CGI interface is "self describing", when you follow the standards, and use the proper decoding for the proper pieces. In that way, it is similar to email. It is certainly not as simple as using UTF-8 everywhere, but compatibility with things invented before UTF-8 even existed somewhat prevents the simplest solution, and then not everything is text, either. At least it is documented, and permits full UNICODE data to be passed around where needed, and permits binary to be passed around where that is needed, when the specs are adhered to. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10841] binary stdio
New submission from Glenn Linderman : Per Antoine's request, I wrote this test code, it isn't elegant, I whipped it together quickly; but it shows the issue. The issue may be one of my ignorance, but it does show the behavior I described in issue 4953. Here's the output from the various test parameters that might be useful in running the test. >c:\python32\python.exe test.py test 1 ['c:\\python32\\python.exe', 'test.py', '1'] All OK >c:\python32\python.exe test.py test 2 ['c:\\python32\\python.exe', 'test.py', '2'] Not OK: b'abc\r\r\ndef\r\r\n' >c:\python32\python.exe test.py test 3 ['c:\\python32\\python.exe', 'test.py', '3'] All OK >c:\python32\python.exe test.py test 4 ['c:\\python32\\python.exe', 'test.py', '4'] Not OK: b'abc\r\r\ndef\r\r\n' >c:\python32\python.exe test.py test 1-u ['c:\\python32\\python.exe', '-u', 'test.py', '1-u'] All OK >c:\python32\python.exe test.py test 2-u ['c:\\python32\\python.exe', '-u', 'test.py', '2-u'] All OK >c:\python32\python.exe test.py test 3-u ['c:\\python32\\python.exe', '-u', 'test.py', '3-u'] All OK >c:\python32\python.exe test.py test 4-u ['c:\\python32\\python.exe', '-u', 'test.py', '4-u'] All OK > Note that test 2 and 4, which do not use the mscvrt stuff, have double \r: one sent by the code, and another added, apparently by MSC newline processing. test 2-u and 4-u, which are invoking the subprocess with Python's -u parameter, also do not exhibit the problem, even though the mscvrt stuff is not used. This seems to indicate that Python's -u parameter does approximately the same thing as my windows_binary function. Seems like if Python already has code for this, that it would be nice to either make it more easily available to the user as an API (like my windows_binary function, invoked with a single line) in the io or sys modules (since it is used to affect sys.std* files). And it would be nice if the function "worked cross-platform", even if it is a noop on most platforms. -- files: test.py messages: 125500 nosy: v+python priority: normal severity: normal status: open title: binary stdio Added file: http://bugs.python.org/file20285/test.py ___ Python tracker <http://bugs.python.org/issue10841> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Pierre said: In all cases the interpreter must be launched with the -u option. As stated in the documentation, the effect of this option is to "force the binary layer of the stdin, stdout and stderr streams (which is available as their buffer attribute) to be unbuffered. The text I/O layer will still be line-buffered.". On my PC (Windows XP) this is required to be able to read all the data stream ; otherwise, only the beginning is read. I tried Glenn's suggestion with mscvrt, with no effect I say: If you start the interpreter with -u, then my mscvrt has no effect. Without it, there is an effect. Read on... Antoine said: Could you open a separate bug with a simple piece of code to reproduce the issue (preferably without launching an HTTP server :))? I say: issue 10841 -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10841] binary stdio
Glenn Linderman added the comment: tested on Windows, for those that aren't following issue 4953 -- components: +IO type: -> behavior ___ Python tracker <http://bugs.python.org/issue10841> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10841] binary stdio
Glenn Linderman added the comment: The same. This can be tested with the same test program, c:\python32\python.exe test.py 1 > test1.txt similar for 2, 3, 4. Then add -u and repeat. All 8 cases produce the same results, either via a pipe, or with a redirected stdout. -- ___ Python tracker <http://bugs.python.org/issue10841> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10841] binary stdio
Glenn Linderman added the comment: Actually, it seems like this "-u" behaviour, should simply be the default for Python 3.x on Windows. The new IO subsystem seems to be able to add \r when desired anyway. And except for Notepad, most programs on Windows can deal with \r\n or solo \n anyway. \r\r\n doesn't cause too many problems for very many programs, but is (1) non-standard (2) wasteful of bytes (3) does cause problems for CGI programs, and likely some others... I haven't done a lot of testing with that case, but tried a few programs, and they dealt with it gracefully. -- ___ Python tracker <http://bugs.python.org/issue10841> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10841] binary stdio
Glenn Linderman added the comment: Is there an easy way for me to find the code for -u? I haven't learned my way around the Python sources much, just peeked in modules that I've needed to fix or learn something from a little. I'm just surprised you think it is orthogonal, but I'm glad you agree it is a bug. -- ___ Python tracker <http://bugs.python.org/issue10841> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10841] binary stdio
Glenn Linderman added the comment: I can read and understand C well enough, having coded in it for about 40 years now... but I left C for Perl and Perl for Python, I try not to code in C when I don't have to, these days, as the P languages are more productive, overall. But there has to be special handling somewhere for opening std*, because they are already open, unlike other files. That is no doubt where the bug is. Can you point me at that code? -- ___ Python tracker <http://bugs.python.org/issue10841> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10841] binary stdio
Glenn Linderman added the comment: I suppose the FileIO in _io is next to look at, wherever it can be found. -- ___ Python tracker <http://bugs.python.org/issue10841> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10841] binary stdio
Glenn Linderman added the comment: Found it. The file browser doesn't tell what line number it is, but in _io/Fileio.c function fileio_init, there is code like #ifdef O_BINARY flags |= O_BINARY; #endif #ifdef O_APPEND if (append) flags |= O_APPEND; #endif if (fd >= 0) { if (check_fd(fd)) goto error; self->fd = fd; self->closefd = closefd; } Note that if O_BINARY is defined, it is set into the default flags for opening files by name. But if "opening" a file by fd, the fd is copied, regardless of whether it has O_BINARY set or not. The rest of the IO code no doubt assumes the file was opened in O_BINARY mode. But that isn't true of MSC std* handles by default. How -u masks or overcomes this problem is not obvious, as yet, but the root bug seems to be the assumption in the above code. A setmode of O_BINARY should be done, probably #ifdef O_BINARY, when attaching a MS C fd to a Python IO stack. Otherwise it is going to have \r\r\n problems, it would seem. Alternately, in the location where the Python IO stacks are attached to std* handles, those specific std* handles should have the setmode done there... other handles, if opened by Python, likely already have it done. Documentation for open should mention, in the description of the file parameter, that on Windows, it is important to only attach Python IO stack to O_BINARY files, or beware the consequences of two independent newline handling algorithms being applied to the data stream... or to document that setmode O_BINARY will be performed on the handles passed to open. -- ___ Python tracker <http://bugs.python.org/issue10841> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Etienne, I'm not sure what you are _really_ referring to by HTTP_TRANSFER_ENCODING. There is a TRANSFER_ENCODING defined by HTTP but it is completely orthogonal to character encoding issues. There is a CONTENT_ENCODING defined which is a character encoding, but that is either explicit in the MIME data, or assumed to be either ASCII or UTF-8, in certain form data contexts. Because the HTTP protocol is binary, only selected data, either explicitly or implicitly (by standard definition) should be decoded, using the appropriate encoding. FieldStorage should be able to (1) read a binary stream (2) do the appropriate decoding operations (3) return the data as bytes or str as appropriate. Right now, I'm mostly interested in the fact that it doesn't do (1), so it is hard to know what it does for (2) or (3) because it gets an error first. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10841] binary stdio
Glenn Linderman added the comment: Don't find "initstdio" "stdio" in pythonrun.c. Has it moved? There are precious few references to stdin, stdout, stderr in that module, mostly for attaching the default encoding. -- ___ Python tracker <http://bugs.python.org/issue10841> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10841] binary stdio
Glenn Linderman added the comment: stderr is notable by its absence in the list of O_BINARY adjustments. So -u does do 2/3 of what my windows_binary() does :) Should I switch my test case to use stderr to demonstrate that it doesn't help with that? I vaguely remember that early versions of DOS didn't do stderr, but I thought by the time Windows came along, let's see, was that about 1983?, that stderr was codified for DOS/Windows. For sure it has never been missing in WinNT 4.0 +. -- ___ Python tracker <http://bugs.python.org/issue10841> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Etienne said: yes, lets not complexify anymore please... Albert Einstein said: Things should be as simple as possible, but no simpler. I say: My "learning" of HTTP predates "chunked". I've mostly heard of it being used in downloads rather than uploads, but I'm not sure if it pertains to uploads or not. Since all the data transfer is effectively chunked by TCP/IP into packets, I'm not clear on what the benefit is, but I am pretty sure it is off-topic for this bug, at least until FieldStorage works at all on 3.x, like for small pieces of data. I meant to say in my preceding response, that the multiple encodings that may be found in an HTTP stream, make it inappropriate to assign an encoding to the file through which the HTTP data streams... that explicit decode calls by FieldStorage should take place on appropriate chunks only. I almost got there, so maybe you picked it up. But I didn't quite say it. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10841] binary stdio
Glenn Linderman added the comment: Makes sense to me. Still should document the open file parameter when passed an fd, and either tell the user that it should be O_BINARY, or that it will be O_BINARYd for them, whichever technique is chosen. But having two newline techniques is bad, and if Python thinks it is tracking the file pointer, but Windows is doing newline translation for it, then it isn't likely tracking it correctly for random access IO. So I think the choice should be that any fd passed in to open on Windows should get O_BINARYd immediately. -- ___ Python tracker <http://bugs.python.org/issue10841> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10841] binary stdio
Glenn Linderman added the comment: Victor, Thanks for your interest and patches. msg125530 points out the location of the code where _all_ fds could be O_BINARYed, when passed in to open. I think this would make all fds open in binary mode, per Guido's comment... he made exactly the comment I was hoping for, even though I didn't +nosy him... I believe this would catch std* along the way, and render your first patch unnecessary, but your second one would likely still be needed. -- ___ Python tracker <http://bugs.python.org/issue10841> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: We have several, myself included, that can't use CGI under 3.x because it doesn't take a binary stream. I believe there are several alternatives: 1) Document that CGI needs a binary stream, and expect the user to provide it, either an explicit handle, or by tweaking sys.stdin before calling with a default file stream. 2) Provide a CGI function for tweaking sys.stdin (along with #1) 3) Document that CGI will attempt to convert passed in streams, default or explicit, to binary, if they aren't already, and implement the code to do so. My choice is #3. I see CGI as being used only in HTTP environments, where the data stream should be binary anyway. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Pierre said: Option 1 is impossible, because the CGI script sometimes has no control on the stream : for instance on a shared web host, it will receive sys.stdin as a text stream I say: It is the user code of the CGI script that calls CGI.FieldStorage. So the user could be required (option 1) to first tweak the stdin to be bytes, one way or another. I don't understand any circumstance where a Python CGI script doesn't have control over the settings of the Python IO Stack that it is using to obtain the data... and the CGI spec is defined as a bytestream, so it must be able to read the bytes. Victor said: It is possible to test the type of the stream. I say: Yes, why just assume (as I have been) that the initial precondition is the defaults that Python imposes. Other code could have interposed something else. The user should be allowed to pass in anything that is a TextIOWrapper, or a BytesIO, and CGI should be able to deal with it. If the user passes some other type, it should be assumed to produce bytes from its read() API, and if it doesn't the user gets what he deserves (an error). Since the default Python sys.stdin is a TextIOWrapper, having CGI detect that, and extract its .buffer to use for obtaining bytes, should work fine. If the user already tweaked sys.stdin to be a BytesIO (.buffer or detach()), CGI should detect and use that. If the user substitutes a different class, it should be bytes, and that should be documented, the three cases that could work. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10841] binary stdio
Glenn Linderman added the comment: Thanks for your work on this Victor, and other commenters also. -- ___ Python tracker <http://bugs.python.org/issue10841> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1602] windows console doesn't print utf8 (Py30a2)
Glenn Linderman added the comment: Interesting! I was able to tweak David-Sarah's code to work with Python 3.x, mostly doing things that 2to3 would probably do: changing unicode() to str(), dropping u from u'...', etc. I skipped the unmangling of command-line arguments, because it produced an error I didn't understand, about needing a buffer protocol. But I'll attach David-Sarah's code + tweaks + a test case showing output of the Cyrillic alphabet to a console with code page 437 (at least, on my Win7-64 box, that is what it is). Nice work, David-Sarah. I'm quite sure this is not in a form usable inside Python 3, but it shows exactly what could be done inside Python 3 to make things work... and gives us a workaround if Python 3 is not fixed. -- Added file: http://bugs.python.org/file20320/unicode2.py ___ Python tracker <http://bugs.python.org/issue1602> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1602] windows console doesn't print utf8 (Py30a2)
Glenn Linderman added the comment: I would certainly be delighted if someone would reopen this issue, and figure out how to translate unicode2.py to Python internals so that Python's console I/O on Windows would support Unicode "out of the box". Otherwise, I'll have to include the equivalent of unicode2.py in all my Python programs, because right now, I'm including instructions for the use to (1) choose Lucida or Consolas font if they can't figure out any other font that gets rid of the square boxes (2) chcp 65001 (3) set PYTHONIOENCODING=UTF-8 Having this capability inside Python (or my programs) will enable me to eliminate two-thirds of the geeky instructions for my users. But it seems like a very appropriate capability to have within Python, especially Python 3.x with its preference and support Unicode in so many other ways. -- ___ Python tracker <http://bugs.python.org/issue1602> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10879] cgi memory usage
New submission from Glenn Linderman : In attempting to review issue 4953, I discovered a conundrum in handling of multipart/formdata. cgi.py has claimed for some time (at least since 2.4) that it "handles" file storage for uploading large files. I looked at the code in 2.6 that handles such, and it uses the rfc822.Message method, which parses headers from any object supporting readline(). In particular, it doesn't attempt to read message bodies, and there is code in cgi.py to perform that. There is still code in 3.2 cgi.py to read message bodies, but... rfc822 has gone away, and been replaced with the email package. Theoretically this is good, but the cgi FieldStorage read_multi method now parses the whole CGI input and then iteration parcels out items to FieldStorage instances. There is a significant difference here: email reads everything into memory (if I understand it correctly). That will never work to upload large or many files when combined with a Web server that launches CGI programs with memory limits. I see several possible actions that could be taken: 1) Documentation. While it is doubtful that any is using 3.x CGI, and this makes it more doubtful, the present code does not match the documentation, because while the documenteation claims to handle file uploads as files, rather than in-memory blobs, the current code does not do that. 2) If there is a method in the email package that corresponds to rfc822.Message, parsing only headers, I couldn't find it. Perhaps it is possible to feed just headers to BytesFeedParser, and stop, and get the same sort of effect. However, this is not the way the cgi.py presently is coded. And if there is a better API, for parsing only headers, that is or could be exposed by email, that might be handy. 3) The 2.6 cgi.py does not claim to support nested multipart/ stuff, only one level. I'm not sure if any present or planned web browsers use nested multipart/ stuff... I guess it would require a nested tag? which is illegal HTML last I checked. So perhaps the general logic flow of 2.6 cgi.py could be reinstated, with a technique to feed only headers to BytesFeedParser, together with reinstating the MIME body parsing in cgi.py,b and this could make a solution that works. I discovered this, beacuase I couldn't figure out where a bunch of the methods in cgi.py were called from, particularly read_lines_to_outerboundary, and make_file. They seemed to be called much too late in the process. It wasn't until I looked back at 2.6 code that I could see that there was a transition from using rfc822 only for headers to using email for parsing the whole data stream, and that that was the cause of the documentation not seeming to match the code logic. I have no idea if this problem is in 2.7, as I don't have it installed here for easy reference, and I'm personally much more interested in 3.2. -- components: Library (Lib) messages: 125884 nosy: r.david.murray, v+python priority: normal severity: normal status: open title: cgi memory usage versions: Python 3.1, Python 3.2, Python 3.3 ___ Python tracker <http://bugs.python.org/issue10879> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: This looks much simpler than the previous patch. However, I think it can be further simplified. This is my first reading of this code, however, so I might be totally missing something(s). Pierre said: Besides FieldStorage, I modified the parse() function at module level, but not parse_multipart (should it be kept at all ?) I say: Since none of this stuff works correctly in 3.x, and since there are comments in the code about "folding" the parse* functions into FieldStorage, then I think they could be deprecated, and not fixed. If people are still using them, by writing code to work around their deficiencies, that code would continue to work for 3.2, but then not in 3.3 when that code is removed? That seems reasonable to me. In this scenario, the few parse* functions that are used by FieldStorage should be copied into FieldStorage as methods (possibly private methods), and fixed there, instead of being fixed in place. That was all the parse* functions could be deprecated, and the use of them would be unchanged for 3.2. Since RFC 2616 says that the HTTP protocol uses ISO-8859-1 (latin-1), I think that should be required here, instead of deferring to fp.encoding, which would eliminate 3 lines. Also, the use of FeedParser could be replaced by BytesFeedParser, thus eliminating the need to decode header lines in that loop. And, since this patch will be applied only to Python 3.2+, the mscvrt code can be removed (you might want a personal copy with it for earlier version of Python 3.x, of course). I wonder if the 'ascii' reference should also be 'latin-1'? In truly reading and trying to understand this code to do a review, I notice a deficiency in _parseparam and parse_header: should I file new issues for them? (perhaps these are unimportant in practice; I haven't seen \ escapes used in HTTP headers). RFC 2616 allows for "" which are handled in _parseparam. And for \c inside "", which is handled in parse_header. But: _parseparam counts " without concern for \", and parse_header allows for \\ and \" but not \f or \j or \ followed by other characters, even though they are permitted (but probably not needed for much). In make_file, shouldn't the encoding and newline parameters be preserved when opening text files? On the other hand, it seems like perhaps we should leverage the power of IO to do our encoding/decoding... open the file with the TextIOBase layer set to the encoding for the MIME part, but then just read binary without decoding it, write it to the .buffer of the TextIOBase, and when the end is reached, flush it, and seek(0). Then the data can be read back from the TextIOBase layer, and it will be appropriate decoded. Decoding errors might be deferred, but will still occur. This technique would save two data operations: the explicit decode in the cgi code, and the implicit encode in the IO layers, so resources would be saved. Additionally, if there is a CONTENT-LENGTH specified for non-binary data, the read_binary method should be used for it also, because it is much more efficient than readlines... less scanning of the data, and fewer outer iterations. This goes well with the technique of leaving that data in binary until read from the file. It seems that in addition to fixing this bug, you are also trying to limit the bytes read by FieldStorage to some maximum (CONTENT_LENGTH). This is good, I guess. But skip_lines() has a readline potentially as long as 32KB, that isn't limited by the maximum. Similar in read_lines_to_outer_boundary, and read_lines_to_eof (although that may not get called in the cases that need to be limited). If a limit is to be checked for, I think it should be a true, exact limit, not an approximate limit. See also issue 10879. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Also, the required behavior of make_file changes, to need the right encoding, or binary, so that needs to be documented as a change for people porting from 2.x. It would be possible, even for files, which will be uploaded as binary, for a user to know the appropriate encoding and, if the file is to be processed rather than saved, supply that encoding for the temporary file. So the temporary file may not want to be assumed to be binary, even though we want to write binary to it. So similarly to the input stream, if it is TextIOBase, we want to write to the .buffer. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10879] cgi memory usage
Glenn Linderman added the comment: Trying to code some of this, it would be handy if BytesFeedParser.feed would return a status, indicating if it has seen the end of the headers yet. But that would only work if it is parsing as it goes, rather than just buffering, with all the real parsing work being done at .close time. -- ___ Python tracker <http://bugs.python.org/issue10879> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: I wrote: Additionally, if there is a CONTENT-LENGTH specified for non-binary data, the read_binary method should be used for it also, because it is much more efficient than readlines... less scanning of the data, and fewer outer iterations. This goes well with the technique of leaving that data in binary until read from the file. I further elucidate: Sadly, while the browser (Firefox) seems to calculate an overall CONTENT-LENGTH for the HTTP headers, it does not seem to calculate CONTENT-LENGTH for individual parts, not even file parts where it would be extremely helpful. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: It seems the choice of whether to make_file or StringIO is based on the existence of self.length... per my previous comment, content-length doesn't seem to appear in any of the multipart/ item headers, so it is unlikely that real files will be created by this code. Sadly that seems to be the case for 2.x also, so I wonder now if CGI has ever properly saved files, instead of buffering in memory... I'm basing this off the use of Firefox Live HTTP headers tool. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Victor said: Don't you think that a warning would be appropriate if sys.stdin is passed here? --- # self.fp.read() must return bytes if isinstance(fp,TextIOBase): self.fp = fp.buffer else: self.fp = fp --- Maybe a DeprecationWarning if we would like to drop support of TextIOWrapper later :-) I say: I doubt we ever want to Deprecate the use of "plain stdin" as the default (or as an explicit) parameter for FieldStorage's fp parameter. Most usage of FieldStorage will want to use stdin; if FieldStorage detects that stdin is TextIOBase (generally it is) and uses its buffer to get binary data, that is very convenient for the typical CGI application. I think I agree with the rest of your comments. Etienne said: is sendfile() available on Windows ? i thought the Apache server could use that to upload files without having to buffer files in memory.. I say: I don't think it is called that, but similar functionality may be available on Windows under another name. I don't know if Apache uses it or not. But I have no idea how FieldStorage could interact with Apache via the CGI interface, to access such features. I'm unaware of any APIs Apache provides for that purpose, but if there are some, let me know. On the other hand, there are other HTTP servers besides Apache to think about. I'm also not sure if sendfile() or equivalent, is possible to use from within FieldStorage, because it seems in practice we don't know the size of the uploaded file without parsing it (which requires buffering it in memory to look at it). -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10879] cgi memory usage
Glenn Linderman added the comment: R. David said: However, I'm not clear on how that helps. Doesn't FieldStorage also load everything into memory? I say: FieldStorage in 2.x (for x <= 6, at least) copies incoming file data to a file, using limited size read/write operations. Non-file data is buffered in memory. In 3.x, FieldStorage doesn't work. The code that is there, though, for multipart/ data, would call email to do all the parsing, which would happen to include file data, which always comes in as part of a multipart/ data stream. This would prevent cgi from being used to accept large files in a limited environment. Sadly, there is code is place that would the copy the memory buffers to files, and act like they were buffered... but process limits do not care that the memory usage is only temporary... -- ___ Python tracker <http://bugs.python.org/issue10879> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Victor said: "Set sys.stdin.buffer.encoding attribute is not a good idea. Why do you modify fp, instead of using a separated attribute on FieldStorage (eg. self.fp_encoding)?" Pierre said: I set an attribute encoding to self.fp because, for each part of a multipart/form-data, a new instance of FieldStorage is created, and this instance needs to know how to decode bytes. So, either an attribute must be set to one of the arguments of the FieldStorage constructor, and fp comes to mind, or an extra argument has to be passed to this constructor, i.e. the encoding of the original stream I say: Ah, now I understand why you did it that way, but: The RFC 2616 says the CGI stream is ISO-8859-1 (or latin-1). The _defined_ encoding of the original stream is irrelevant, in the same manner that if it is a text stream, that is irrelevant. The stream is binary, and latin-1, or it is non-standard. Hence, there is not any reason to need a parameter, just use latin-1. If non-standard streams are to be supported, I suppose that would require a parameter, but I see no need to support non-standard streams: it is hard enough to support standard streams without complicating things. The encoding provided with stdin is reasonably unlikely to be latin-1: Linux defaults to UTF-8 (at least on many distributions), and Windows to CP437, and in either case is configurable by the sysadmin. But even the sysadmin should not be expected to configure the system locale to have latin-1 as the default encoding for the system, just because one of the applications that might run is an CGI program. So I posit that the encoding on fp is irrelevan t and should be ignored, and using it as a parameter between FieldStorage instances is neither appropriate nor necessary, as the standard defines latin-1 as the encoding for the stream. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Victor said: I mean: you should pass sys.stdin.buffer instead of sys.stdin. I say: That would be possible, but it is hard to leave it at default, in that case, because sys.stdin will, by default, not be a binary stream. It is a convenience for FieldStorage to have a useful default for its input, since RFC 3875 declares that the message body is obtained from "standard input". Pierre said: I wish it could be as simple, but I'm afraid it's not. On my PC, sys.stdin.encoding is cp-1252. I tested a multipart/form-data with an INPUT field, and I entered the euro character, which is encoded \x80 in cp-1252 If I use the encoding defined for sys.stdin (cp-1252) to decode the bytes received on sys.stdin.buffer, I get the correct value in the cgi script ; if I set the encoding to latin-1 in FieldStorage, since \x80 maps to undefined in latin-1, I get a UnicodeEncodeError if I try to print the value ("character maps to ") I say: Interesting. I'm curious what your system (probably Windows since you mention cp-) and browser, and HTTP server is, that you used for that test. Is it possible to capture the data stream for that test? Describe how, and at what stage the data stream was captured, if you can capture it. Most interesting would be on the interface between browser and HTTP server. RFC 3875 states (section 4.1.3) what the default encodings should be, but I see that the first possibility is "system defined". On the other hand, it seems to imply that it should be a system definition specifically defined for particular media types, not just a general system definition such as might be used as a default encoding for file handles... after all, most Web communication crosses system boundaries. So lacking a system defined definition for text/ types, it then indicates that the default for text/ types is Latin-1. I wonder what result you get with the same browser, at the web page http://rishida.net/tools/conversion/ by entering the euro symbol into the Characters entry field, and choosing convert. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: I said: I wonder what result you get with the same browser, at the web page http://rishida.net/tools/conversion/ by entering the euro symbol into the Characters entry field, and choosing convert. But I couldn't wait, so I ran a test with € in one of my input boxes, using Firefox, a FORM as: <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: R. David: Pierre said: BytesFeedParser only uses the ascii codec ; if the header has non ASCII characters (filename in a multipart/form-data), they are replaced by ? : the original file name is lost. So for the moment I leave the text version of FeedParser I say: Does this mean BytesFeedParser, to be useful for cgi.py, needs to accept an input parameter encoding, defaulting to ASCII for the email case? Should that be a new issue? Or should cgi.py, since it can't use email to do all its work (no support for file storage, no support for encoding) simply not try, and use its own code for header decoding also? The only cost would be support for Encoded-Word -- but it is not clear that HTTP uses them? Can anyone give an example of such? Read the next message here for an example of filename containing non-ASCII. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: In my previous message I quoted Pierre rightly cautioning about headers containing non-ASCII... and that BytesFeedParser doesn't, so using it to parse headers may be questionable. So I decided to try one... I show the Live HTTP headers below, from a simple upload form. What is not so simple is the filename of the file to be uploaded... it contains a couple non-ASCII characters... in fact, one of them is non-latin-1 also: "foöţ.html". It rather seems that Firefox provides the filename in UTF-8, although Live HTTP headers seems to have displayed it using Latin-1 on the screen! But in saving it to a file, it didn't write a BOM, and the byte sequence for the filename is definitely UTF-8, and pasted here to be viewed correctly. So my question: where does Firefox get its authority to encode the filename using UTF-8 ??? User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 115 Connection: keep-alive Referer: http://rkivs.com.gl:8032/row/test.html Content-Type: multipart/form-data; boundary=---207991835220448 Content-Length: 304 -207991835220448 Content-Disposition: form-data; name="submit" upload -207991835220448 Content-Disposition: form-data; name="pre"; filename="foöţ.html" Content-Type: text/html aoheutns -207991835220448-- -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Pierre said: Since it works the same with 2 browsers and 2 web servers, I'm almost sure it's not dependant on the configuration - but if others can tests on different configurations I'd like to know the result So I showed in my just previous messages (after the one you are responding to) my output from Live HTTP Headers, where it seems that Firefox is using UTF-8 transmission, both for header values (filename) and data values (euro character). Without specifying Content-Type (for the data) or doing RFC 2047 encoding as would be expected from reading the various standard documents (RFC 2045, W3 HTML 4.01, RFC 2388). I wonder now if Live HTTP Headers is reporting the logical data, prior to encoding for transmission. But I was getting UTF-8 data inside my CGI script... So now I tweaked the server to save the bytes it transfers its rfile to the cgi process (had already tweaked that to be binary instead of having encodings), and it is clearly UTF-8 at that point also. Looks just like the Live HTTP headers. Now that I have data-capture on the server side, I can run the same tests with other browsers... so I ran it with Opera 11, IE 8, Chrome 8, and the only differences were the specific value of the boundaries... all the data was in UTF-8, both filename, and form data value. I can't now find a setting for Firefox to allow the user to control the encoding it sends to the server, but I can't rule out that I once might have, and set it to UTF-8. But I'm quite certain I don't know enough about the other browsers to adjust their settings. I don't have Apache installed on this box, so I cannot test to see if it changes something. Is there a newer standard these browsers are following, that permits UTF-8? Or even requires it? Why is Pierre seeing cp-1252, and I'm seeing UTF-8? I'm running Windows 6.1 (Build 7600), 64-bit, the so-called Windows 7 Professional edition. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Aha! Found a page <http://htmlpurifier.org/docs/enduser-utf8.html#whyutf8-support> which links to another page <http://web.archive.org/web/20060427015200/ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html> that explains the behavior. The synopsis is that browsers (all modern browsers) return form data Form data is generally returned in the same character encoding as the Form page itself was sent to the client. I suspect this explains the differences between what Pierre and I are reporting. I suspect (but would appreciate confirmation from Pierre), that his web pages use or else do not use such a meta tag, and his server is configured (or defaults) to send HTTP headers: Content-Type: text/html; charset=CP-1252 Whereas, I do know that all my web pages are coded in UTF-8, have no meta tags, and my CGI scripts are sending Content-Type: text/html; charset=UTF-8 for all served form pages... and thus getting back UTF-8 also, per the above explanation. What does this mean for Python support for http.server and cgi? Well, http.server, by default, sends Content-Type without charset, except for directory listings, where it supplies charset= the result of sys.getfilesystemcoding(). So it is up to META tags to define the coding, or for the browser to guess. That's probably OK: for a single machine environment, it is likely that the data files are coded in the default file system encoding, and it is likely the browser will guess that. But it quickly breaks when going to a multiple machine or internet environment with different default encodings on different machines. So if using http.server in such an environment, it is necessary to inform the client of the page encoding using META tags, or generating the Content-Type: HTTP header in the CGI script (which latter is what I'm doing for the forms and data of interest). What does it mean for cgi.py's FieldStorage? Well, use of the default encoding can work in the single machine environment... so I guess there are would be worse things that doing so, as Pierre has been doing. But clearly, that isn't the complete solution. The new parameter he proposes to FieldStorage can be used, if the application can properly determine the likeliest encoding for the form data, before calling it. On a single machine system, that could be the default, as mentioned above. On a single application web server, it could be some constant encoding used for all pages (like I use UTF-8 for all my pages). For a multiple application web server, as long as each application uses a consistent encoding, that application could properly guess the encoding to pass to FieldStorage. Or, if the application wishes to allow multiple encodings, as long as it can keep track of them, and use the right ones at the right time, it is welcome to. How does this affect email? Not at all, directly. How does this affect cgi.py's use of email? It means that cgi.py cannot use BytesFeedParser, in spite of what the standards say, so Pierre's approach of predecoding the headers is the correct one, since email doesn't offer an encoding parameter. Since email doesn't support disk storage for file uploads, but buffers everything in memory, it means that cgi.py can only pass headers to FeedParser, so has to detect end-of-headers itself, since email provides no feedback to indicate that end-of-headers was reached, and that means that cgi.py must parse the MIME parts itself, so it can put the large parts on disk. It means that the email package provides extremely little value to cgi.py, and since web browsers and multipart/form-data use simple subsets of the full power of RFC822 headers, email could be replaced with the use of its existing parse_header function, but that should be deprecated. A copy could be moved inside FieldStorage class and fixed a bit. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: I notice the version on this issue is Python 3.3, but it affects 3.2 and 3.1 as well. While I would like to see it fixed for 3.2, perhaps it is too late for that, with rc1 coming up this weekend? Could at least the non-deprecated parse functions be deprecated in 3.2, so that they could be removed in 3.3? Or should we continue to support them? -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Pierre, I applied your patch to my local copy of cgi.py for my installation of 3.2, and have been testing. Lots of things work great! My earlier comment regarding make_file seems to be relevant. Files that are not binary should have an encoding. Likely you removed the encoding because it was a hard-coded UTF-8 and that didn't work for you, with your default encoding of cp-1252. However, now that I am passing in UTF-8 via the stream-encoding parameter, because that is what matches my form-data, I get an error that cp-1252 (apparently also my default encoding, except for console stuff which is 437) cannot encode \u0163. So I think the encoding parameter should be added back in, but the value used should be the stream_encoding parameter. You might also turn around the test on self.filename: import tempfile if self.filename: return tempfile.TemporaryFile("wb+") else: return tempfile.TemporaryFile("w+", encoding=self.stream_encoding, newline="\n") One of my tests used a large textarea and a short file. I was surprised to see that the file was not stored as a file, but the textarea was. I guess that is due to the code in read_single that checks length rather than filename to decide whether it should be stored in a file from the get-go. It seems that this behaviour, while probably more efficient than actually creating a file, might be surprising to folks overriding make_file so that they could directly store the data in the final destination file, instead of copying it later. The documented semantics for make_file do not state that it is only called if there is more than 1000 bytes of data, or that the form_data item headers contain a CONTENT-LENGTH header (which never seems to happen). Indeed, I found a comment on StackOverflow where someone had been surprised that small files did not have make_file called on them. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: I'd be willing to propose such a patch and tests, but I haven't a clue how, other than starting by reading the contributor document... I was putting off learning the process until hg conversion, not wanting to learn an old process for a few months :( And I've never written an official Python test, or learned how to use the test modules, etc. So that's a pretty steep curve for the 2 days remaining. Due to the way that browsers actually work, vs. how the standards are written, it seems necessary to add the optional stream_encoding parameter. The limit parameter Pierre is proposing is also a good check against improperly formed inputs. So there are new, optional parameters to the FieldStorage constructor. Without these fixes, though, cgi.py continues to be totally useless for file uploads, so not releasing this in 3.2 makes 3.2 continue to be useless as a basis for web applications. I have no idea if there is a timeframe for 3.3, nor what it is. I'm not sure if, or how many, web frameworks use cgi.py vs. replacing the functionality. Seems at least some replace it, so they may not suffer in porting to 3.x (except internally, grappling with the same issues). Happily, Pierre's latest patch needs only one more fix, per my (non-Python-standard) testing. Between his testing in one environment using default code pages, and mine using UTF-8, the bases seem to be pretty well covered for testing... certainly more than the previous default tests. I think you contributed some tests, I haven't tried them, but it seems Pierre has, as he has a patch for that also (which I haven't tried). -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Pierre said: The encoding used by the browser is defined in the Content-Type meta tag, or the content-type header ; if not, the default seems to vary for different browsers. So it's definitely better to define it The argument stream_encoding used in FieldStorage *must* be this encoding I say: I agree it is better to define it. I think you just said the same thing that the page I linked to said, I might not have conveyed that correctly in my paraphrasing. I assume you are talking about the charset of the Content-Type of the form page itself, as served to the browser, as the browser, sadly, doesn't send that charset back with the form data. Pierre says: But this raises another problem, when the CGI script has to print the data received. The built-in print() function encodes the string with sys.stdout.encoding, and this will fail if the string can't be encoded with it. It is the case on my PC, where sys.stdout.encoding is cp1252 : it can't handle Arabic or Chinese characters I say: I don't think there is any need to override print, especially not builtins.print. It is still true that the HTTP data stream is and should be treated as a binary stream. So the script author is responsible for creating such a binary stream. The FieldStorage class does not use the print method, so it seems inappropriate to add a parameter to its constructor to create a print method that it doesn't use. For the convenience of CGI script authors, it would be nice if CGI provided access to the output stream in a useful way... and I agree that because the generation of an output page comes complete with its own encoding, that the output stream encoding parameter should be separate from the stream_encoding parameter required for FieldStorage. A separate, new function or class for doing that seems appropriate, possibly included in cgi.py, but not in FieldStorage. Message 125100 in this issue describes a class IOMix that I wrote and use for such; codifying it by including it in cgi.py would be fine by me... I've been using it quite successfully for some months now. The last line of Message 125100 may be true, perhaps a few more methods should be added. However, print is not one of them. I think you'll be pleasantly surprised to discover (as I was, after writing that line) that the builtins.print converts its parameters to str, and writes to stdout, assuming that stdout will do the appropriate encoding. The class IOMix will, in fact, do that appropriate encoding (given an appropriate parameter to its initialization. Perhaps for CGI, a convenience function could be added to IOMix to include the last two code lines after IOMix in the prior message: @staticmethod def setup( encoding="UTF-8"): sys.stdout = IOMix( sys.stdout, encoding ) sys.stderr = IOMix( sys.stderr, encoding ) Note that IOMix allows the users choice of output stream encoding, applies it to both stdout and stderr, which both need it, and also allows the user to generate binary directly (if sending back a file, for example), as both bytes and str are accepted. print can be used with a file= parameter in 3.x which your implementation doesn't permit, and which could be used to write to other files by a CGI script, so I really, really don't think we want to override builtins.print without the file= parameter, and specifically tying it to stdout. My message 126075 still needs to be included in your next patch. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.0
Glenn Linderman added the comment: Pierre, Looking better. I see you've retained the charset parameter, but do not pass it through to nested calls of FieldStorage. This is good, because it wouldn't work if you did. However, purists might still complain that FieldStorage should only ever use and affect stdin... however, since I'm a pragmatist, I'll note that the default charset value is None, which means it does nothing to stdout or stderr by default, and be content with that. I've run a couple basic tests and it works, and the other things the code hasn't changed since your last iteration, but I'll test them again after I get some sleep. I'll try setting the Version here back to 3.2 -- it is a bug in 3.2 -- and see if some committer will take pity on web developers that use CGI, and are hoping to be able to use Python 3.2 someday. -- versions: +Python 3.2 ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.x
Glenn Linderman added the comment: The O_BINARY stuff was probably necessary because issue 10841 is not yet in the build Pierre was using? I agree it in not necessary with the fix for that issue, but neither does it hurt. It could be stripped out, if you think that is best, Antoine. But there is a working patch. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.x
Glenn Linderman added the comment: Victor, thanks for your comments, and interest in this bug. Other than the existence of the charset parameter, and whether or not to include IOMix, I think all of the others could be fixed later, and do not hurt at present. So I will just comment on those two comments. I would prefer to see FieldStorage not have the charset attribute also, but I don't have the practice to produce an alternate patch, and I can see that it would be a convenience for some CGI scripts to specify that parameter, and have one API call do all the work necessary to adjust the IO streams, and read all the parameters, and then the rest of the logic of the web app can follow. Personally, I adjust the stdout/stderr streams earlier in my scripts, and only optionally call FieldStorage, if I determine the request needs such. I've been using IOMix for some months (I have a version for both Python 2 and 3), and it solves a real problem in generating web page data streams... the data stream should be bytes, but a lot of the data is manipulated using str, which would then need to be decoded. The default encoding of stdout is usually wrong, so must somehow be changed. And when you have chunks of bytes (in my experience usually from a database or file) to copy to the output stream, if your prior write was str, and then you write bytes to sys.stdout.binary, you have to also remember to flush the TextIOBuffer first. IOMix provides a convenient solution to all these problems, doing the flushing for you automatically, and just taking what comes and doing the right thing. If I hadn't already invented IOMix to help write web pages, I would want to :) -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.x
Glenn Linderman added the comment: Graham, Thanks for your comments. Fortunately, if the new charset parameter is not supplied, no mucking with stdout or stderr is done, which is the only reason I cannot argue strongly against the feature, which I would have implemented as a separate API... it doesn't get in the way if you don't use it. I would be happy to see the argv code removed, but it has been there longer than I have been a Python user, so I just live with it ... and don't pass arguments to my CGI scripts anyway. I've assumed that is some sort of a debug feature, but I also saw some code in the HTTPCGIServer and http.server that apparently, on some platforms, actually do pass parameters to CGI on the command lines. I would be happy to see that code removed too, but it also predates my Python experience. And no signs of "if debug:" by either of them! -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.x
Glenn Linderman added the comment: Pierre, Thank you for the new patch, with the philosophy of "it's broke, so let's produce something the committers like to get it fixed". I see you overlooked removing the second use of O_BINARY. Locally, I removed that also, and tested your newest patch, and it still functions great for me. -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.x
Changes by Glenn Linderman : -- versions: +Python 3.2 -Python 3.3 ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4953] cgi module cannot handle POST with multipart/form-data in 3.x
Glenn Linderman added the comment: Thanks to Pierre for producing patch after patch and testing testing testing, and to Victor for committing it, as well as others that contributed in smaller ways, as I tried to. I look forward to 3.2 rc1 so I can discard all my temporary patched copies of cgi.py -- ___ Python tracker <http://bugs.python.org/issue4953> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1602] windows console doesn't print or input Unicode
Glenn Linderman added the comment: Victor said: Why do you set the code page to 65001? In all my tests (on Windows XP), it always break the standard input. My response: Because when I searched Windows for Unicode and/or UTF-8 stuff, I found 65001, and it seems like it might help, and it does a bit. And then I find PYTHONIOENCODING, and that helps some. And that got me something that works better enough than what I had before, so I quit searching. You did a better job of analyzing and testing all the cases. I will have to go subtract the 65001 part, and confirm your results, maybe it is useless now that other pieces of the puzzle are in place. Certainly with David-Sarah's code it seems to not be needed, whether it was a necessary part of the previous workaround I am not sure, because of the limited number of cases I tried (trying to find something that worked well enough, but not having enough knowledge to find David-Sarah's solution, nor a good enough testing methodology to try the pieces independently. Thank your for your interest in this issue. -- ___ Python tracker <http://bugs.python.org/issue1602> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10879] cgi memory usage
Glenn Linderman added the comment: Issue 4953 has somewhat resolved this issue by using email only for parsing headers (more like 2.x did). So this issue could be closed, or could be left open to point out the required additional features needed from email before cgi.py can use it for handling body parts as well as headers. -- ___ Python tracker <http://bugs.python.org/issue10879> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10479] cgitb.py should assume a binary stream for output
Glenn Linderman added the comment: So since cgi.py was fixed to use the .buffer attribute of sys.stdout, that leaves sys.stdout itself as a character stream, and cgitb.py can successfully write to that. If cgitb.py never writes anything but ASCII, then maybe that should be documented, and this issue closed. If cgitb.py writes non-ASCII, then it should use an appropriate encoding for the web application, which isn't necessarily the default encoding on the system. Some user control over the appropriate encoding should be given, or it should be documented that the encoding of sys.stdout should be changed to an appropriate encoding, because that is where cgitb.py will write its character stream. Guidance on how to do that would be appropriate for the documentation also, as a CGI application may be the first one a programmer might write that can't just use the default encoding configured for the system. -- ___ Python tracker <http://bugs.python.org/issue10479> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10480] cgi.py should document the need for binary stdin/stdout
Glenn Linderman added the comment: Fixed by issue 10841 and issue 4953. -- status: open -> closed ___ Python tracker <http://bugs.python.org/issue10480> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com