Re: Whittle it on down
On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
> On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote:
>> Start by writing a function or a regex that will distinguish strings that
>> match your conditions from those that don't. A regex might be faster, but
>> here's a function version.
>> ... snip ...
>
> Yikes. I'm all for the idea that one shouldn't go to regex when Python's
> powerful string type can answer the problem more clearly, but this seems
> to go out of its way to do otherwise.
>
> I don't even care about faster: Its overly complicated. Sometimes a
> regular expression really is the clearest way to solve a problem.
You're probably right, but I find it easier to reason about matching in
Python rather than the overly terse, cryptic regular expression mini-
language.
I haven't tested my function version, but I'm 95% sure that it is correct.
It trickiest part of it is the logic about splitting around ampersands. And
I'll cheerfully admit that it isn't easy to extend to (say) "ampersand, or
at signs". But your regex solution:
r"^[A-Z\s&]+$"
is much smaller and more compact, but *wrong*. For instance, your regex
wrongly accepts both "&" and " " as valid strings, and wrongly
rejects "ΔΣΘΛ". Your Greek customers will be sad...
Oh, I just realised, I should have looked more closely at the examples
given. because the specification given by DFS does not match the examples.
DFS says that only uppercase letters and ampersands are allowed, but their
examples include strings with spaces, e.g. 'FITNESS CENTERS' despite the
lack of ampersands. (I read the spec literally as spaces only allowed if
they surround an ampersand.) Oops, mea culpa. That makes the check function
much simpler and easier to extend:
def check(string):
string = string.replace("&", "").replace(" ", "")
return string.isalpha() and string.isupper()
and now I'm 95% confident it is correct without testing, this time for sure!
;-)
--
Steve
--
https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 12:04 AM, Steven D'Aprano wrote: > On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > > > On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote: > >> Start by writing a function or a regex that will distinguish strings that > >> match your conditions from those that don't. A regex might be faster, but > >> here's a function version. > >> ... snip ... > > > > Yikes. I'm all for the idea that one shouldn't go to regex when Python's > > powerful string type can answer the problem more clearly, but this seems > > to go out of its way to do otherwise. > > > > I don't even care about faster: Its overly complicated. Sometimes a > > regular expression really is the clearest way to solve a problem. > > You're probably right, but I find it easier to reason about matching in > Python rather than the overly terse, cryptic regular expression mini- > language. > > I haven't tested my function version, but I'm 95% sure that it is > correct. > It trickiest part of it is the logic about splitting around ampersands. > And > I'll cheerfully admit that it isn't easy to extend to (say) "ampersand, > or > at signs". But your regex solution: > > r"^[A-Z\s&]+$" > > is much smaller and more compact, but *wrong*. For instance, your regex > wrongly accepts both "&" and " " as valid strings, and wrongly > rejects "ΔΣΘΛ". Your Greek customers will be sad... Meh. You have a pedantic definition of wrong. Given the inputs, it produced right output. Very often that's enough. Perfect is the enemy of good, it's said. There's no situation where "&" and " " will exist in the given dataset, and recognizing that is important. You don't have to account for every bit of nonsense. If the OP needs a unicode-aware solution that redefines "A-Z" as perhaps "\w" with an isupper call. Its still far simpler then you're suggesting. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
Oh, a further thought...
On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
> On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote:
>> Start by writing a function or a regex that will distinguish strings that
>> match your conditions from those that don't. A regex might be faster, but
>> here's a function version.
>> ... snip ...
>
> Yikes. I'm all for the idea that one shouldn't go to regex when Python's
> powerful string type can answer the problem more clearly, but this seems
> to go out of its way to do otherwise.
>
> I don't even care about faster: Its overly complicated. Sometimes a
> regular expression really is the clearest way to solve a problem.
Putting non-ASCII letters aside for the moment, how would you match these
specs as a regular expression?
- All uppercase ASCII letters (A to Z only), optionally separated into words
by either a bare ampersand (e.g. "AAA&AAA") or an ampersand with leading and
trailing spaces (spaces only, not arbitrary whitespace): "AAA & AAA".
- The number of spaces on either side of the ampersands need not be the
same: "AAA& BBB & CCC" should match.
- Leading or trailing spaces, or spaces not surrounding an ampersand, must
not match: "AAA BBB" must be rejected.
- Leading or trailing ampersands must also be rejected. This includes the
case where the string is nothing but ampersands.
- Consecutive ampersands "AAA&&&BBB" and the empty string must be rejected.
I get something like this:
r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
but it fails on strings like "AA & A & A". What am I doing wrong?
For the record, here's my brief test suite:
def test(pat):
for s in ("", " ", "&" "A A", "A&", "&A", "A&&A", "A& &A"):
assert re.match(pat, s) is None
for s in ("A", "A & A", "AA&A", "AA & A & A"):
assert re.match(pat, s)
--
Steve
--
https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
Steven D'Aprano wrote:
> Oh, a further thought...
>
>
> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
>
>> On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote:
>>> Start by writing a function or a regex that will distinguish strings
>>> that match your conditions from those that don't. A regex might be
>>> faster, but here's a function version.
>>> ... snip ...
>>
>> Yikes. I'm all for the idea that one shouldn't go to regex when Python's
>> powerful string type can answer the problem more clearly, but this seems
>> to go out of its way to do otherwise.
>>
>> I don't even care about faster: Its overly complicated. Sometimes a
>> regular expression really is the clearest way to solve a problem.
>
> Putting non-ASCII letters aside for the moment, how would you match these
> specs as a regular expression?
>
> - All uppercase ASCII letters (A to Z only), optionally separated into
> words by either a bare ampersand (e.g. "AAA&AAA") or an ampersand with
> leading and
> trailing spaces (spaces only, not arbitrary whitespace): "AAA & AAA".
>
> - The number of spaces on either side of the ampersands need not be the
> same: "AAA& BBB & CCC" should match.
>
> - Leading or trailing spaces, or spaces not surrounding an ampersand, must
> not match: "AAA BBB" must be rejected.
>
> - Leading or trailing ampersands must also be rejected. This includes the
> case where the string is nothing but ampersands.
>
> - Consecutive ampersands "AAA&&&BBB" and the empty string must be
> rejected.
>
>
> I get something like this:
>
> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
>
>
> but it fails on strings like "AA & A & A". What am I doing wrong?
>
>
> For the record, here's my brief test suite:
>
>
> def test(pat):
> for s in ("", " ", "&" "A A", "A&", "&A", "A&&A", "A& &A"):
> assert re.match(pat, s) is None
> for s in ("A", "A & A", "AA&A", "AA & A & A"):
> assert re.match(pat, s)
>>> def test(pat):
... for s in ("", " ", "&" "A A", "A&", "&A", "A&&A", "A& &A"):
... assert re.match(pat, s) is None
... for s in ("A", "A & A", "AA&A", "AA & A & A"):
... assert re.match(pat, s)
...
>>> test("^A+( *& *A+)*$")
>>>
--
https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thursday 05 May 2016 17:34, Stephen Hansen wrote: > Meh. You have a pedantic definition of wrong. Given the inputs, it > produced right output. Very often that's enough. Perfect is the enemy of > good, it's said. And this is a *perfect* example of why we have things like this: http://www.bbc.com/future/story/20160325-the-names-that-break-computer- systems "Nobody will ever be called Null." "Nobody has quotation marks in their name." "Nobody will have a + sign in their email address." "Nobody has a legal gender other than Male or Female." "Nobody will lean on the keyboard and enter gobbledygook into our form." "Nobody will try to write more data than the space they allocated for it." > There's no situation where "&" and " " will exist in the given > dataset, and recognizing that is important. You don't have to account > for every bit of nonsense. Whenever a programmer says "This case will never happen", ten thousand computers crash. http://www.kr41.net/2016/05-03-shit_driven_development.html -- Steven D'Aprano -- https://mail.python.org/mailman/listinfo/python-list
Re: No SQLite newsgroup, so I'll ask here about SQLite, python and MS Access
There's a gmane 'newsgroup from a mailing list' for sqlite:- gmane.comp.db.sqlite.general It's quite active and helpful too. (Also 'announce' and others) -- Chris Green · -- https://mail.python.org/mailman/listinfo/python-list
smtplib not working when python run under windows service via Local System account
I have a python 2.7.10 script which is being run under a windows service on windows 2012 server . The python script uses smtplib to send an email. It works fine when the windows service is run as a local user, but not when the windows service is configured to run as Local System account. I get no exception from smtplib, but the email fails to arrive. Any ideas? -- https://mail.python.org/mailman/listinfo/python-list
Re: Interacting with Subprocesses
On Wed, May 4, 2016 at 4:04 PM, Akira Li <[email protected]> wrote: > > Pass stdin=PIPE, stdout=PIPE and use p.stdin, p.stdout file objects to > write input, read output from the child process. > > Beware, there could be buffering issues or the child process may change > its behavior some other way when the standard input/output streams are > redirected. See > http://pexpect.readthedocs.io/en/stable/FAQ.html#whynotpipe On Linux, you may be able to use stdbuf [1] to modify standard I/O buffering. stdbuf sets the LD_PRELOAD [2] environment variable to load libstdbuf.so [3]. For example, the following shows the environment variables created by "stdbuf -oL": $ stdbuf -oL python -c 'import os;print os.environ["LD_PRELOAD"]' /usr/lib/coreutils/libstdbuf.so $ stdbuf -oL python -c 'import os;print os.environ["_STDBUF_O"]' L [1]: http://www.gnu.org/software/coreutils/manual/html_node/stdbuf-invocation.html [2]: http://www.linuxjournal.com/article/7795 [3]: http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/libstdbuf.c?id=v8.21 On Windows, if you can modify the program, then you can check for a command-line option or an environment variable, like Python's -u and PYTHONUNBUFFERED. If you can't modify the source, I think you might be able to hack something similar to the Linux LD_PRELOAD environment variable by creating a stdbuf.exe launcher that debugs the process and injects a DLL after the loader's first-chance breakpoint. The injected stdbuff.dll would need to be able to get the standard streams for common CRTs, such as by calling __acrt_iob_func for ucrtbase.dll. Also, unlike the Linux command, stdbuf.exe would have to wait on the child, since the Windows API doesn't have fork/exec. -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/5/2016 2:04 AM, Steven D'Aprano wrote:
On Thursday 05 May 2016 14:58, DFS wrote:
Want to whittle a list like this:
[...]
Want to keep all elements containing only upper case letters or upper
case letters and ampersand (where ampersand is surrounded by spaces)
Start by writing a function or a regex that will distinguish strings that
match your conditions from those that don't. A regex might be faster, but
here's a function version.
def isupperalpha(string):
return string.isalpha() and string.isupper()
def check(string):
if isupperalpha(string):
return True
parts = string.split("&")
if len(parts) < 2:
return False
# Don't strip leading spaces from the start of the string.
parts[0] = parts[0].rstrip(" ")
# Or trailing spaces from the end of the string.
parts[-1] = parts[-1].lstrip(" ")
# But strip leading and trailing spaces from the middle parts
# (if any).
for i in range(1, len(parts)-1):
parts[i] = parts[i].strip(" ")
return all(isupperalpha(part) for part in parts)
Now you have two ways of filtering this. The obvious way is to extract
elements which meet the condition. Here are two ways:
# List comprehension.
newlist = [item for item in oldlist if check(item)]
# Filter, Python 2 version
newlist = filter(check, oldlist)
# Filter, Python 3 version
newlist = list(filter(check, oldlist))
In practice, this is the best (fastest, simplest) way. But if you fear that
you will run out of memory dealing with absolutely humongous lists with
hundreds of millions or billions of strings, you can remove items in place:
def remove(func, alist):
for i in range(len(alist)-1, -1, -1):
if not func(alist[i]):
del alist[i]
Note the magic incantation to iterate from the end of the list towards the
front. If you do it the other way, Bad Things happen. Note that this will
use less memory than extracting the items, but it will be much slower.
You can combine the best of both words. Here is a version that uses a
temporary list to modify the original in place:
# works in both Python 2 and 3
def remove(func, alist):
# Modify list in place, the fast way.
alist[:] = filter(check, alist)
You are out of your mind.
--
https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/5/2016 1:39 AM, Stephen Hansen wrote: pattern = re.compile(r"^[A-Z\s&]+$") output = [x for x in list if pattern.match(x)] Holy Shr"^[A-Z\s&]+$" One line of parsing! I was figuring a few list comprehensions would do it - this is better. (note: the reason I specified 'spaces around ampersand' is so it would remove 'Q&A' if that ever came up - but some people write 'Q & A', so I'll live with that exception, or try to tweak it myself. You're the man, man. Thank you! -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/5/2016 1:53 AM, Jussi Piitulainen wrote: Either way is easy to approximate with a regex: import re upper = re.compile(r'[A-Z &]+') lower = re.compile(r'[^A-Z &]') print([datum for datum in data if upper.fullmatch(datum)]) print([datum for datum in data if not lower.search(datum)]) This is similar to Hansen's solution. I've skipped testing that the ampersand is between spaces, and I've skipped the period. Adjust. Will do. This considers only ASCII upper case letters. You can add individual letters that matter to you, or you can reach for the documentation to find if there is some generic notation for all upper case letters. The newer regex package on PyPI supports POSIX character classes like [:upper:], I think, and there may or may not be notation for Unicode character categories in re or regex - LU would be Letter, Uppercase. Thanks. -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 04:41, Steven D'Aprano wrote: > > There's no situation where "&" and " " will exist in the given > > dataset, and recognizing that is important. You don't have to account > > for every bit of nonsense. > > Whenever a programmer says "This case will never happen", ten thousand > computers crash. What crash can including such an entry in the output list cause? Should the regex also ensure that the data only includes *english words* separated by space-ampersand-space? -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 03:36, Steven D'Aprano wrote: > Putting non-ASCII letters aside for the moment, how would you match these > specs as a regular expression? Well, obviously *your* language (not the OP's), given the cases you reject, is "one or more sequences of letters separated by space*-ampersand-space*", and that is actually one of the easiest kinds of regex to write: "[A-Z]+( *& *[A-Z]+)*". However, your spec is wrong: > - Leading or trailing spaces, or spaces not surrounding an ampersand, > must not match: "AAA BBB" must be rejected. The *very first* item in OP's list of good outputs is 'PHYSICAL FITNESS CONSULTANTS & TRAINERS'. If you want something that's extremely conservative (except for the *very odd in context* choice of allowing arbitrary numbers of spaces - why would you allow this but reject leading or trailing space?) and accepts all of OP's input: [A-Z]+(( *& *| +)[A-Z]+)* -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote: > Oh, a further thought... > > On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > > I don't even care about faster: Its overly complicated. Sometimes a > > regular expression really is the clearest way to solve a problem. > > Putting non-ASCII letters aside for the moment, how would you match these > specs as a regular expression? I don't know, but mostly because I wouldn't even try. The requirements are over-specified. If you look at the OP's data (and based on previous conversation), he's doing web scraping and trying to pull out good data. There's no absolutely perfect way to do that because the system he's scraping isn't meant for data processing. The data isn't cleanly articulated. Instead, he wants a heuristic to pull out what look like section titles. The OP looked at the data and came up with a simple set of rules that identify these section titles: >> Want to keep all elements containing only upper case letters or upper case letters and ampersand (where ampersand is surrounded by spaces) This translates naturally into a simple regular expression: an uppercase string with spaces and &'s. Now, that expression doesn't 100% encode every detail of that rule-- it allows both Q&A and Q & A-- but on my own looking at the data, I suspect its good enough. The titles are clearly separate from the other data scraped by their being upper cased. We just need to expand our allowed character range into spaces and &'s. Nothing in the OP's request demands the kind of rigorous matching that your scenario does. Its a practical problem with a simple, practical answer. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/5/2016 9:32 AM, Stephen Hansen wrote: On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote: Oh, a further thought... On Thursday 05 May 2016 16:46, Stephen Hansen wrote: I don't even care about faster: Its overly complicated. Sometimes a regular expression really is the clearest way to solve a problem. Putting non-ASCII letters aside for the moment, how would you match these specs as a regular expression? I don't know, but mostly because I wouldn't even try. The requirements are over-specified. If you look at the OP's data (and based on previous conversation), he's doing web scraping and trying to pull out good data. There's no absolutely perfect way to do that because the system he's scraping isn't meant for data processing. The data isn't cleanly articulated. Instead, he wants a heuristic to pull out what look like section titles. Assigned by a company named localeze, apparently. http://www.usdirectory.com/cat/g0 https://www.neustarlocaleze.biz/welcome/ The OP looked at the data and came up with a simple set of rules that identify these section titles: Want to keep all elements containing only upper case letters or upper case letters and ampersand (where ampersand is surrounded by spaces) This translates naturally into a simple regular expression: an uppercase string with spaces and &'s. Now, that expression doesn't 100% encode every detail of that rule-- it allows both Q&A and Q & A-- but on my own looking at the data, I suspect its good enough. The titles are clearly separate from the other data scraped by their being upper cased. We just need to expand our allowed character range into spaces and &'s. Nothing in the OP's request demands the kind of rigorous matching that your scenario does. Its a practical problem with a simple, practical answer. Yes. And simplicity + practicality = successfulality. And I do a sanity check before using the data anyway: after parse and cleanup and regex matching, I make sure all lists have the same number of elements: lenData = [len(title),len(names),len(addr),len(street),len(city),len(state),len(zip)] if len(set(lenData)) != 1: alert the media -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, 5 May 2016 06:17 pm, Peter Otten wrote:
>> I get something like this:
>>
>> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
>>
>>
>> but it fails on strings like "AA & A & A". What am I doing wrong?
> test("^A+( *& *A+)*$")
Thanks Peter, that's nice!
--
Steven
--
https://mail.python.org/mailman/listinfo/python-list
Ctypes c_void_p overflow
I have CDLL function I use to get a pointer, several other functions happily accept this pointer which is really a long when passed to ctypes.c_void_p. However, only one with same type def in the prototype overflows. Docs suggest c_void_p takes an int but that is not what the first call returns, nor what all but one function happily accept? Anyone familiar enough with ctypes that can shed some light? Thanks, jlc -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, 5 May 2016 11:13 pm, Random832 wrote: > On Thu, May 5, 2016, at 04:41, Steven D'Aprano wrote: >> > There's no situation where "&" and " " will exist in the given >> > dataset, and recognizing that is important. You don't have to account >> > for every bit of nonsense. >> >> Whenever a programmer says "This case will never happen", ten thousand >> computers crash. > > What crash can including such an entry in the output list cause? How do I know? It depends what you do with that list. But if you assume that your list contains alphabetical strings, and pass it on to code that expects alphabetical strings, why is it so hard to believe that it might choke when it receives a non-alphabetical string? > Should the regex also ensure that the data only includes *english words* > separated by space-ampersand-space? That wasn't part of the specification. But for some applications, yes, you should ensure the data includes only English words. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Ctypes c_void_p overflow
On Fri, 6 May 2016 01:42 am, Joseph L. Casale wrote: > I have CDLL function I use to get a pointer, several other functions > happily accept this pointer which is really a long when passed to > ctypes.c_void_p. However, only one with same type def in the prototype > overflows. Docs suggest c_void_p takes an int but that is not what the > first call returns, nor what all but one function happily accept? > > Anyone familiar enough with ctypes that can shed some light? I'm not a ctypes expert, but you might get better responses if you show the code you're using, the expected result, and the result you actually get. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, 5 May 2016 11:32 pm, Stephen Hansen wrote: > On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote: >> Oh, a further thought... >> >> On Thursday 05 May 2016 16:46, Stephen Hansen wrote: >> > I don't even care about faster: Its overly complicated. Sometimes a >> > regular expression really is the clearest way to solve a problem. >> >> Putting non-ASCII letters aside for the moment, how would you match these >> specs as a regular expression? > > I don't know, but mostly because I wouldn't even try. Really? Peter Otten seems to have found a solution, and Random832 almost found it too. > The requirements > are over-specified. If you look at the OP's data (and based on previous > conversation), he's doing web scraping and trying to pull out good data. I'm not talking about the OP's data. I'm talking about *my* requirements. I thought that this was a friendly discussion about regexes, but perhaps I was mistaken. Because I sure am feeling a lot of hostility to the ideas that regexes are not necessarily the only way to solve this, and that data validation is a good thing. > There's no absolutely perfect way to do that because the system he's > scraping isn't meant for data processing. The data isn't cleanly > articulated. Right. Which makes it *more*, not less, important to be sure that your regex doesn't match too much, because your data is likely to be contaminated by junk strings that don't belong in the data and shouldn't be accepted. I've done enough web scraping to realise just how easy it is to start grabbing data from the wrong part of the file. > Instead, he wants a heuristic to pull out what look like section titles. Good for him. I asked a different question. Does my question not count? > The OP looked at the data and came up with a simple set of rules that > identify these section titles: > >>> Want to keep all elements containing only upper case letters or upper > case letters and ampersand (where ampersand is surrounded by spaces) That simple rule doesn't match his examples, as I know too well because I made the silly mistake of writing to the written spec as written without reading the examples as well. As I already admitted. That was a silly mistake because I know very well that people are really bad at writing detailed specs that neither match too much nor too little. But you know, I was more focused on the rest of his question, namely whether it was better to extract the matches strings into a new list, or delete the non-matches from the existing string, and just got carried away writing the match function. I didn't actually expect anyone to use it. It was untested, and I hinted that a regex would probably be better. I was trying to teach DFS a generic programming technique, not solve his stupid web scraping problem for him. What happens next time when he's trying to filter a list of floats, or Widgets? Should he convert them to strings so he can use a regex to match them, or should he learn about general filtering techniques? > This translates naturally into a simple regular expression: an uppercase > string with spaces and &'s. Now, that expression doesn't 100% encode > every detail of that rule-- it allows both Q&A and Q & A-- but on my own > looking at the data, I suspect its good enough. The titles are clearly > separate from the other data scraped by their being upper cased. We just > need to expand our allowed character range into spaces and &'s. > > Nothing in the OP's request demands the kind of rigorous matching that > your scenario does. Its a practical problem with a simple, practical > answer. Yes, and that practical answer needs to reject: - the empty string, because it is easy to mistakenly get empty strings when scraping data, especially if you post-process the data; - strings that are all spaces, because " " cannot possibly be a title; - strings that are all ampersands, because "&" is not a title, and it almost surely indicates that your scraping has gone wrong and you're reading junk from somewhere; - even leading and trailing spaces are suspect: " FOO " doesn't match any of the examples given, and it seems unlikely to be a title. Presumably the strings have already been filtered or post-processed to have leading and trailing spaces removed, in which case " FOO " reveals a bug. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, 5 May 2016 10:31 pm, DFS wrote: > You are out of your mind. That's twice you've tried to put me down, first by dismissing my comments about text processing with "Linguist much", and now an outright insult. The first time I laughed it off and made a joke about it. I won't do that again. You asked whether it was better to extract the matching strings into a new list, or remove them in place in the existing list. I not only showed you how to do both, but I tried to give you the mental tools to understand when you should pick one answer over the other. And your response is to insult me and question my sanity. Well, DFS, I might be crazy, but I'm not stupid. If that's really how you feel about my answers, I won't make the mistake of wasting my time answering your questions in the future. Over to you now. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
Steven D'Aprano writes:
> I get something like this:
>
> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
>
>
> but it fails on strings like "AA & A & A". What am I doing wrong?
It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS)
when the middle part is just one LETTER. That's something of a
misanalysis anyway. I notice that the correct pattern has already been
posted at least thrice and you have acknowledged one of them.
But I think you are also trying to do too much with a single regex. A
more promising start is to think of the whole string as "parts" joined
with "glue", then split with a glue pattern and test the parts:
import re
glue = re.compile(" *& *| +")
keep, drop = [], []
for datum in data:
items = glue.split(datum)
if all(map(str.isupper, items)):
keep.append(datum)
else:
drop.append(datum)
That will cope with Greek, by the way.
It's annoying that the order of the branches of the glue pattern above
matters. One _does_ have problems when one uses the usual regex engines.
Capturing groups in the glue pattern would produce glue items in the
split output. Either avoid them or deal with them: one could split with
the underspecific "([ &]+)" and then check that each glue item contains
at most one ampersand. One could also allow other punctuation, and then
check afterwards.
One can use _another_ regex to test individual parts. Code above used
str.isupper to test a part. The improved regex package (from PyPI, to
cope with Greek) can do the same:
import regex
part = regex.compile("[[:upper:]]+")
glue = regex.compile(" *& *| *")
keep, drop = [], []
for datum in data:
items = glue.split(datum)
if all(map(part.fullmatch, items)):
keep.append(datum)
else:
drop.append(datum)
Just "[A-Z]+" suffices for ASCII letters, and "[A-ZÄÖ]+" copes with most
of Finnish; the [:upper:] class is nicer and there's much more that is
nicer in the newer regex package.
The point of using a regex for this is that the part pattern can then be
generalized to allow some punctuation or digits in a part, for example.
Anything that the glue pattern doesn't consume. (Nothing wrong with
using other techniques for this, either; str.isupper worked nicely
above.)
It's also possible to swap the roles of the patterns. Split with a part
pattern. Then check that the text between such parts is glue:
keep, drop = [], []
for datum in data:
items = part.split(datum)
if all(map(glue.fullmatch, items)):
keep.append(datum)
else:
drop.append(datum)
The point is to keep the patterns simple by making them more local, or
more relaxed, followed by a further test. This way they can be made to
do more, but not more than they reasonably can.
Note also the use of re.fullmatch instead of re.match (let alone
re.search) when a full match is required! This gets rid of all anchors
in the pattern, which may in turn allow fewer parentheses inside the
pattern.
The usual regex engines are not perfect, but parts of them are
fantastic.
--
https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, 5 May 2016 11:21 pm, Random832 wrote:
> On Thu, May 5, 2016, at 03:36, Steven D'Aprano wrote:
>> Putting non-ASCII letters aside for the moment, how would you match these
>> specs as a regular expression?
>
> Well, obviously *your* language (not the OP's), given the cases you
> reject, is "one or more sequences of letters separated by
> space*-ampersand-space*", and that is actually one of the easiest kinds
> of regex to write: "[A-Z]+( *& *[A-Z]+)*".
One of the easiest kind of regex to write incorrectly:
py> re.match("[A-Z]+( *& *[A-Z]+)*", "A")
<_sre.SRE_Match object at 0xb7bf4aa0>
It doesn't even get the "all uppercase" part of the specification:
py> re.match("[A-Z]+( *& *[A-Z]+)*", "Azzz")
<_sre.SRE_Match object at 0xb7bf4aa0>
You failed to anchor the string at the beginning and end of the string, an
easy mistake to make, but that's the point. It's easy to make mistakes with
regexes because the syntax is so overly terse and unforgiving.
But I think I just learned something important today. I learned that's it's
not actually regexes that I dislike, it's regex culture that I dislike.
What I learned from this thread:
- Nobody could possibly want to support non-ASCII text. (Apart from the
approximately 6.5 billion people in the world that don't speak English of
course, an utterly insignificant majority.)
- Data validity doesn't matter, because there's no possible way that you
might accidentally scrape data from the wrong part of a HTML file and end
up with junk input.
- Even if you do somehow end up with junk, there couldn't possibly be any
real consequences to that.
- It doesn't matter if you match too much, or to little, that just means the
specs are too pedantic.
Hence the famous quote:
Some people, when confronted with a problem, think
"I know, I'll use regular expressions." Now they
have two problems.
It's not really regexes that are the problem.
> However, your spec is wrong:
How can you say that? It's *my* spec, I can specify anything I want.
>> - Leading or trailing spaces, or spaces not surrounding an ampersand,
>> must not match: "AAA BBB" must be rejected.
>
> The *very first* item in OP's list of good outputs is 'PHYSICAL FITNESS
> CONSULTANTS & TRAINERS'.
That's very nice, but irrelevant. I'm not talking about the OP's outputs.
I'm giving my own.
--
Steven
--
https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Fri, 6 May 2016 03:49 am, Jussi Piitulainen wrote:
> Steven D'Aprano writes:
>
>> I get something like this:
>>
>> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
>>
>>
>> but it fails on strings like "AA & A & A". What am I doing wrong?
>
> It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS)
> when the middle part is just one LETTER. That's something of a
> misanalysis anyway. I notice that the correct pattern has already been
> posted at least thrice and you have acknowledged one of them.
Thrice? I've seen Peter's response (he made the trivial and obvious
simplification of just using A instead of [A-Z], but that was easy to
understand), and Random832 almost got it, missing only that you need to
match the entire string, not just a substring. If there was a third
response, I missed it.
> But I think you are also trying to do too much with a single regex. A
> more promising start is to think of the whole string as "parts" joined
> with "glue", then split with a glue pattern and test the parts:
>
> import re
> glue = re.compile(" *& *| +")
> keep, drop = [], []
> for datum in data:
> items = glue.split(datum)
> if all(map(str.isupper, items)):
> keep.append(datum)
> else:
> drop.append(datum)
Ah, the penny drops! For a while I thought you were suggesting using this to
assemble a regex, and it just wasn't making sense to me. Then I realised
you were using this as a matcher: feed in the list of strings, and it
splits it into strings to keep and strings to discard. Nicely done, that is
a good technique to remember.
Thanks for the analysis!
--
Steven
--
https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
Steven D'Aprano writes: > On Fri, 6 May 2016 03:49 am, Jussi Piitulainen wrote: > >> Steven D'Aprano writes: >> >>> I get something like this: >>> >>> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)" >>> >>> >>> but it fails on strings like "AA & A & A". What am I doing wrong? >> >> It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS) >> when the middle part is just one LETTER. That's something of a >> misanalysis anyway. I notice that the correct pattern has already been >> posted at least thrice and you have acknowledged one of them. > > Thrice? I've seen Peter's response (he made the trivial and obvious > simplification of just using A instead of [A-Z], but that was easy to > understand), and Random832 almost got it, missing only that you need to > match the entire string, not just a substring. If there was a third > response, I missed it. I think I saw another. I may be mistaken. Random832's pattern is fine. You need to use re.fullmatch with it. . . -- https://mail.python.org/mailman/listinfo/python-list
PyDev 5.0.0 Released
PyDev 5.0.0 Released Release Highlights: --- * **Important** PyDev now requires Java 8. * PyDev 4.5.5 is the last release supporting Java 7. * See: http://www.pydev.org/update_sites/index.html for the update site of older versions of PyDev. * See: the **PyDev does not appear after install** section on http://www.pydev.org/download.html for help on using a Java 8 vm in Eclipse. * PyUnit view now persists its state across restarts. * Fixed issue in super() code completion. * PyDev.Debugger updated to the latest version. * No longer showing un-needed shell on Linux on startup when showing donation dialog. * Fixed pyedit_wrap_expression to avoid halt of the IDE on Ctrl+1 -> Wrap expression. What is PyDev? --- PyDev is an open-source Python IDE on top of Eclipse for Python, Jython and IronPython development. It comes with goodies such as code completion, syntax highlighting, syntax analysis, code analysis, refactor, debug, interactive console, etc. Details on PyDev: http://pydev.org Details on its development: http://pydev.blogspot.com What is LiClipse? --- LiClipse is a PyDev standalone with goodies such as support for Multiple cursors, theming, TextMate bundles and a number of other languages such as Django Templates, Jinja2, Kivy Language, Mako Templates, Html, Javascript, etc. It's also a commercial counterpart which helps supporting the development of PyDev. Details on LiClipse: http://www.liclipse.com/ Cheers, -- Fabio Zadrozny -- Software Developer LiClipse http://www.liclipse.com PyDev - Python Development Environment for Eclipse http://pydev.org http://pydev.blogspot.com PyVmMonitor - Python Profiler http://www.pyvmmonitor.com/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 14:03, Steven D'Aprano wrote: > You failed to anchor the string at the beginning and end of the string, > an easy mistake to make, but that's the point. I don't think anchoring is properly a concern of the regex itself - .match is anchored implicitly at the beginning, and one could easily imagine an API that implicitly anchors at the end - or you can simply check that the match length == the string length. > - Data validity doesn't matter, because there's no possible way that you > might accidentally scrape data from the wrong part of a HTML file and end > up with junk input. If you've scraped data from the wrong part of the file, then nothing you do to your regex can prevent the junk input from coincidentally matching the input format. -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 14:27, Jussi Piitulainen wrote: > Random832's pattern is fine. You need to use re.fullmatch with it. Heh, in my previous post I said "and one could easily imagine an API that implicitly anchors at the end". So easy to imagine it turns out that someone already did, as it turns out. Batteries included indeed. -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 10:43 AM, Steven D'Aprano wrote: > On Thu, 5 May 2016 11:32 pm, Stephen Hansen wrote: > > > On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote: > >> Oh, a further thought... > >> > >> On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > >> > I don't even care about faster: Its overly complicated. Sometimes a > >> > regular expression really is the clearest way to solve a problem. > >> > >> Putting non-ASCII letters aside for the moment, how would you match these > >> specs as a regular expression? > > > > I don't know, but mostly because I wouldn't even try. > > Really? Peter Otten seems to have found a solution, and Random832 almost > found it too. > > > > The requirements > > are over-specified. If you look at the OP's data (and based on previous > > conversation), he's doing web scraping and trying to pull out good data. > > I'm not talking about the OP's data. I'm talking about *my* requirements. > > I thought that this was a friendly discussion about regexes, but perhaps > I > was mistaken. Because I sure am feeling a lot of hostility to the ideas > that regexes are not necessarily the only way to solve this, and that > data > validation is a good thing. Umm, what? Hostility? I have no idea where you're getting that. I didn't say that regexs are the only way to solve problems; in fact they're something I avoid using in most cases. In the OP's case, though, I did say I thought was a natural fit. Usually, I'd go for startswith/endswith, "in", slicing and such string primitives before I go for a regular expression. "Find all upper cased phrases that may have &'s in them" is something just specific enough that the built in string primitives are awkward tools. In my experience, most of the problems with regexes is people think they're the hammer and every problem is a nail: and then they get into ever more convoluted expressions that become brittle. More specific in a regular expression is not, necessarily, a virtue. In fact its exactly the opposite a lot of times. > > There's no absolutely perfect way to do that because the system he's > > scraping isn't meant for data processing. The data isn't cleanly > > articulated. > > Right. Which makes it *more*, not less, important to be sure that your > regex > doesn't match too much, because your data is likely to be contaminated by > junk strings that don't belong in the data and shouldn't be accepted. > I've > done enough web scraping to realise just how easy it is to start grabbing > data from the wrong part of the file. I have nothing against data validation: I don't think it belongs in regular expressions, though. That can be a step done afterwards. > > Instead, he wants a heuristic to pull out what look like section titles. > > Good for him. I asked a different question. Does my question not count? Sure it counts, but I don't want to engage in your theoretical exercise. That's not being hostile, that's me not wanting to think about a complex set of constraints for a regular expression for purely intellectual reasons. > I was trying to teach DFS a generic programming technique, not solve his > stupid web scraping problem for him. What happens next time when he's > trying to filter a list of floats, or Widgets? Should he convert them to > strings so he can use a regex to match them, or should he learn about > general filtering techniques? Come on. This is a bit presumptuous, don't you think? > > This translates naturally into a simple regular expression: an uppercase > > string with spaces and &'s. Now, that expression doesn't 100% encode > > every detail of that rule-- it allows both Q&A and Q & A-- but on my own > > looking at the data, I suspect its good enough. The titles are clearly > > separate from the other data scraped by their being upper cased. We just > > need to expand our allowed character range into spaces and &'s. > > > > Nothing in the OP's request demands the kind of rigorous matching that > > your scenario does. Its a practical problem with a simple, practical > > answer. > > Yes, and that practical answer needs to reject: > > - the empty string, because it is easy to mistakenly get empty strings > when > scraping data, especially if you post-process the data; > > - strings that are all spaces, because " " cannot possibly be a > title; > > - strings that are all ampersands, because "&" is not a title, and it > almost surely indicates that your scraping has gone wrong and you're > reading junk from somewhere; > > - even leading and trailing spaces are suspect: " FOO " doesn't match > any > of the examples given, and it seems unlikely to be a title. Presumably > the > strings have already been filtered or post-processed to have leading and > trailing spaces removed, in which case " FOO " reveals a bug. We're going to have to agree to disagree. I find all of that unnecessary. Any validation can be easily done before or after matching, you don't need to over-complica
Re: Whittle it on down
On Thu, May 5, 2016, at 05:31 AM, DFS wrote: > You are out of your mind. Whoa, now. I might disagree with Steven D'Aprano about how to approach this problem, but there's no need to be rude. Everyone's trying to help you, after all. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Python is an Equal Opportunity Programming Language
https://motherboard.vice.com/blog/python-is-an-equal-opportunity-programming-language from an 'Intel® Software Evangelist' -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 11:03 AM, Steven D'Aprano wrote:
> - Nobody could possibly want to support non-ASCII text. (Apart from the
> approximately 6.5 billion people in the world that don't speak English of
> course, an utterly insignificant majority.)
Oh, I'd absolutely want to support non-ASCII text. If I have unicode
input, though, I unfortunately have to rely on
https://pypi.python.org/pypi/regex as 're' doesn't support matching on
character properties.
I keep hoping it'll replace "re", then we could do:
pattern = regex.compile(ru"^\p{Lu}\s&]+$")
where \p{property} matches against character properties in the unicode
database.
> - Data validity doesn't matter, because there's no possible way that you
> might accidentally scrape data from the wrong part of a HTML file and end
> up with junk input.
Um, no one said that. I was arguing that the *regular expression*
doesn't need to be responsible for validation.
> - Even if you do somehow end up with junk, there couldn't possibly be any
> real consequences to that.
No one said that either...
> - It doesn't matter if you match too much, or to little, that just means
> the
> specs are too pedantic.
Or that...
--
Stephen Hansen
m e @ i x o k a i . i o
--
https://mail.python.org/mailman/listinfo/python-list
Re: Comparing Python enums to Java, was: How much sanity checking is required for function inputs?
On 04/24/2016 08:20 AM, Ian Kelly wrote: On Sun, Apr 24, 2016 at 1:20 AM, Ethan Furman wrote: What fun things can Java enums do? Everything that Python enums can do, plus: > --> Planet.EARTH.value (5.976e+24, 6378140.0) --> Planet.EARTH.surface_gravity 9.802652743337129 This is incredibly useful, but it has a flaw: the value of each member of the enum is just the tuple of its arguments. Suppose we added a value for COUNTER_EARTH describing a hypothetical planet with the same mass and radius existing on the other side of the sun. [1] Then: --> Planet.EARTH is Planet.COUNTER_EARTH True If using Python 3 and aenum 1.4.1+, you can do --> class Planet(Enum, settings=NoAlias, init='mass radius'): ... MERCURY = (3.303e+23, 2.4397e6) ... VENUS = (4.869e+24, 6.0518e6) ... EARTH = (5.976e+24, 6.37814e6) ... COUNTER_EARTH = EARTH ... @property ... def surface_gravity(self): ... # universal gravitational constant (m3 kg-1 s-2) ... G = 6.67300E-11 ... return G * self.mass / (self.radius * self.radius) ... --> Planet.EARTH.value (5.976e+24, 6378140.0) --> Planet.EARTH.surface_gravity 9.802652743337129 --> Planet.COUNTER_EARTH.value (5.976e+24, 6378140.0) --> Planet.COUNTER_EARTH.surface_gravity 9.802652743337129 Planet.EARTH is Planet.COUNTER_EARTH False * Speaking of AutoNumber, since Java enums don't have the instance/value distinction, they effectively do this implicitly, only without generating a bunch of ints that are entirely irrelevant to your enum type. With Python enums you have to follow a somewhat arcane recipe to avoid specifying values, which just generates some values and then hides them away. And it also breaks the Enum alias feature: --> class Color(AutoNumber): ... red = default = () # not an alias! ... blue = () ... Another thing you could do here: --> class Color(Enum, settings=AutoNumber): ... red ... default = red ... blue ... --> list(Color) [, ] --> Color.default is Color.red True -- ~Ethan~ -- https://mail.python.org/mailman/listinfo/python-list
Re: Ctypes c_void_p overflow
On Thu, May 5, 2016 at 10:42 AM, Joseph L. Casale wrote: > I have CDLL function I use to get a pointer, several other functions happily > accept this > pointer which is really a long when passed to ctypes.c_void_p. However, only > one with > same type def in the prototype overflows. Docs suggest c_void_p takes an int > but that > is not what the first call returns, nor what all but one function happily > accept? What you're describing isn't clear to me, so I'll describe the general case of handling pointers with ctypes functions. If a function returns a pointer, you must set the function's restype to a pointer type since the default c_int restype truncates the upper half of a 64-bit pointer. Generally you also have to do the same for pointer parameters in argtypes. Otherwise integer arguments are converted to C int values. Note that when restype is set to c_void_p, the result gets converted to a Python integer (or None for a NULL result). If you pass this result back as a ctypes function argument, the function must have the parameter set to c_void_p in argtypes. If argtypes isn't set, the default integer conversion may truncate the pointer value. This problem won't occur on a 32-bit platform, so there's a lot of carelessly written ctypes code that makes this mistake. Simple types are also automatically converted when accessed as a field of a struct or union or as an index of an array or pointer. To avoid this, you can use a subclass of the type, since ctypes won't automatically convert subclasses of simple types. I generally avoid c_void_p because its lenient from_param method (called to convert arguments) doesn't provide much type safety. If a bug causes an incorrect argument to be passed, I prefer getting an immediate ctypes.ArgumentError rather than a segfault or data corruption. For example, when a C API returns a void pointer as a handle for an opaque structure or object, I prefer to handle it as a pointer to an empty Structure subclass, as follows: class _ContosoHandle(ctypes.Structure): pass ContosoHandle = ctypes.POINTER(_ContosoHandle) lib.CreateContoso.restype = ContosoHandle lib.DestroyContoso.argtypes = (ContosoHandle,) ctypes will raise an ArgumentError if DestroyContoso is called with arguments such as 123456789 or "Crash Me". -- https://mail.python.org/mailman/listinfo/python-list
python, ctypes and GetIconInfo issue
Hello,
I try to make the GetIconInfo function work, but I can't figure out
what I'm doing wrong.
>From the MSDN documentation the function is
https://msdn.microsoft.com/en-us/library/windows/desktop/ms648070%28v=vs.85%29.aspx
# BOOL WINAPI GetIconInfo(
# _In_ HICON hIcon,
# _Out_ PICONINFO piconinfo
# );
which I defined as
GetIconInfo = windll.user32.GetIconInfo
GetIconInfo.argtypes = [HICON, POINTER(ICONINFO)]
GetIconInfo.restype= BOOL
GetIconInfo.errcheck = ErrorIfZero
The structure piconinfo is described as
https://msdn.microsoft.com/en-us/library/windows/desktop/ms648052%28v=vs.85%29.aspx
# typedef struct _ICONINFO {
# BOOLfIcon;
# DWORD xHotspot;
# DWORD yHotspot;
# HBITMAP hbmMask;
# HBITMAP hbmColor;
# } ICONINFO, *PICONINFO;
my implementation is
class ICONINFO(Structure):
__fields__ = [
('fIcon', BOOL),
('xHotspot', DWORD),
('yHotspot', DWORD),
('hbmMask', HBITMAP),
('hbmColor', HBITMAP),
]
not part of the problem but needed to get the icon handle
hicon = ImageList_GetIcon(def_il_handle,1,ILD_NORMAL)
print hicon
As the documentation states, the function run successful if return code is
none zero. Well I get 1 returned but as soon as I try to access a class member
the program crashes.
iconinfo = ICONINFO()
lres = GetIconInfo(hicon, pointer(iconinfo))
print lres
print '{0}'.format(sizeof(iconinfo)) # <- crash
If I comment the print of sizeof... the program keeps running but if I call
the same code a second time then it crashes at GetIconInfo(hicon, ...)
So it looks like I'm doing something terribly wrong but don't see it.
Can someone shed some light on it?
Thank you
Hubert
--
https://mail.python.org/mailman/listinfo/python-list
RE: Ctypes c_void_p overflow
> I generally avoid c_void_p because its lenient from_param method > (called to convert arguments) doesn't provide much type safety. If a > bug causes an incorrect argument to be passed, I prefer getting an > immediate ctypes.ArgumentError rather than a segfault or data > corruption. For example, when a C API returns a void pointer as a > handle for an opaque structure or object, I prefer to handle it as a > pointer to an empty Structure subclass, as follows: > > class _ContosoHandle(ctypes.Structure): > pass > > ContosoHandle = ctypes.POINTER(_ContosoHandle) > > lib.CreateContoso.restype = ContosoHandle > lib.DestroyContoso.argtypes = (ContosoHandle,) > > ctypes will raise an ArgumentError if DestroyContoso is called with > arguments such as 123456789 or "Crash Me". After typing up a response with all the detail, your reply helped me see the error. Thank you so much for all that detail, it was very much appreciated! jlc -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/5/2016 1:54 PM, Steven D'Aprano wrote:
On Thu, 5 May 2016 10:31 pm, DFS wrote:
You are out of your mind.
That's twice you've tried to put me down, first by dismissing my comments
about text processing with "Linguist much", and now an outright insult. The
first time I laughed it off and made a joke about it. I won't do that
again.
>
You asked whether it was better to extract the matching strings into a new
list, or remove them in place in the existing list. I not only showed you
how to do both, but I tried to give you the mental tools to understand when
you should pick one answer over the other. And your response is to insult
me and question my sanity.
Well, DFS, I might be crazy, but I'm not stupid. If that's really how you
feel about my answers, I won't make the mistake of wasting my time
answering your questions in the future.
Over to you now.
heh! Relax, pal.
I was just trying to be funny - no insult intended either time, of
course. Look for similar responses from me in the future. Usenet
brings out the smart-aleck in me.
Actually, you should've accepted the 'Linguist much?' as a compliment,
because I seriously thought you were.
But you ARE out of your mind if you prefer that convoluted "function"
method over a simple 1-line regex method (as per S. Hansen).
def isupperalpha(string):
return string.isalpha() and string.isupper()
def check(string):
if isupperalpha(string):
return True
parts = string.split("&")
if len(parts) < 2:
return False
parts[0] = parts[0].rstrip(" ")
parts[-1] = parts[-1].lstrip(" ")
for i in range(1, len(parts)-1):
parts[i] = parts[i].strip(" ")
return all(isupperalpha(part) for part in parts)
I'm sure it does the job well, but that style brings back [bad] memories
of the VBA I used to write. I expected something very concise and
'pythonic' (which I'm learning is everyone's favorite mantra here in
python-land).
Anyway, I appreciate ALL replies to my queries. So thank you for taking
the time.
Whenever I'm able, I'll try to contribute to clp as well.
--
https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/5/2016 2:56 PM, Stephen Hansen wrote: On Thu, May 5, 2016, at 05:31 AM, DFS wrote: You are out of your mind. Whoa, now. I might disagree with Steven D'Aprano about how to approach this problem, but there's no need to be rude. Seriously not trying to be rude - more smart-alecky than anything. Hope D'Aprano doesn't stay butthurt... Everyone's trying to help you, after all. Yes, and I do appreciate it. I've only been working with python for about a month, but I feel like I'm making good progress. clp is a great resource, and I'll be hanging around for a long time, and will contribute when possible. Thanks for your help. -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/5/2016 1:39 AM, Stephen Hansen wrote: Given: input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add/Update Listing', 'Business Profile Login', 'F.A.Q.'] Then: pattern = re.compile(r"^[A-Z\s&]+$") output = [x for x in list if pattern.match(x)] output ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS'] Should've looked earlier. Their master list of categories http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes, and the ampersands we talked about. "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma. "AUTOMOBILE - DEALERS" gets removed because of the dash. I updated your regex and it seems to have fixed it. orig: (r"^[A-Z\s&]+$") new : (r"^[A-Z\s&,-]+$") Thanks again. -- https://mail.python.org/mailman/listinfo/python-list
Re: python, ctypes and GetIconInfo issue
On Thu, May 5, 2016 at 3:47 PM, wrote:
>
> I try to make the GetIconInfo function work, but I can't figure out
> what I'm doing wrong.
>
> From the MSDN documentation the function is
>
> https://msdn.microsoft.com/en-us/library/windows/desktop/ms648070%28v=vs.85%29.aspx
>
> # BOOL WINAPI GetIconInfo(
> # _In_ HICON hIcon,
> # _Out_ PICONINFO piconinfo
> # );
>
> which I defined as
>
> GetIconInfo = windll.user32.GetIconInfo
> GetIconInfo.argtypes = [HICON, POINTER(ICONINFO)]
> GetIconInfo.restype= BOOL
> GetIconInfo.errcheck = ErrorIfZero
Please avoid windll. It caches the loaded library, which in turn
caches function pointers. So all packages that use windll.user32 are
potentially stepping on each others' toes with mutually incompatible
function prototypes. It also doesn't allow configuring
use_last_error=True to enable ctypes.get_last_error() for WinAPI
function calls.
> The structure piconinfo is described as
> https://msdn.microsoft.com/en-us/library/windows/desktop/ms648052%28v=vs.85%29.aspx
>
> # typedef struct _ICONINFO {
> # BOOLfIcon;
> # DWORD xHotspot;
> # DWORD yHotspot;
> # HBITMAP hbmMask;
> # HBITMAP hbmColor;
> # } ICONINFO, *PICONINFO;
>
> my implementation is
>
> class ICONINFO(Structure):
> __fields__ = [
> ('fIcon', BOOL),
> ('xHotspot', DWORD),
> ('yHotspot', DWORD),
> ('hbmMask', HBITMAP),
> ('hbmColor', HBITMAP),
> ]
The attribute name is "_fields_", not "__fields__", so you haven't
actually defined any fields and sizeof(ICONINFO) is 0. When you pass
this empty struct to GetIconInfo, it potentially overwrites and
corrupts existing data on the heap that can lead to a crash later on.
Here's the setup I created to test GetIconInfo and GetIconInfoEx.
Maybe you can reuse some of this code, but if you're using XP this
won't work as written because GetIconInfoEx was added in Vista.
Note the use of a __del__ finalizer to call DeleteObject on the
bitmaps. Otherwise, in a real application, calling GetIconInfo would
leak memory. Using __del__ is convenient, but note that you can't
reuse an instance without manually calling DeleteObject on the
bitmaps.
import ctypes
from ctypes import wintypes
kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
user32 = ctypes.WinDLL('user32', use_last_error=True)
gdi32 = ctypes.WinDLL('gdi32')
MAX_PATH = 260
IMAGE_ICON = 1
class ICONINFO_BASE(ctypes.Structure):
def __del__(self, gdi32=gdi32):
if self.hbmMask:
gdi32.DeleteObject(self.hbmMask)
self.hbmMask = None
if self.hbmColor:
gdi32.DeleteObject(self.hbmColor)
self.hbmColor = None
class ICONINFO(ICONINFO_BASE):
_fields_ = (('fIcon',wintypes.BOOL),
('xHotspot', wintypes.DWORD),
('yHotspot', wintypes.DWORD),
('hbmMask', wintypes.HBITMAP),
('hbmColor', wintypes.HBITMAP))
class ICONINFOEX(ICONINFO_BASE):
_fields_ = (('cbSize',wintypes.DWORD),
('fIcon', wintypes.BOOL),
('xHotspot', wintypes.DWORD),
('yHotspot', wintypes.DWORD),
('hbmMask', wintypes.HBITMAP),
('hbmColor', wintypes.HBITMAP),
('wResID',wintypes.WORD),
('szModName', wintypes.WCHAR * MAX_PATH),
('szResName', wintypes.WCHAR * MAX_PATH))
def __init__(self, *args, **kwds):
super(ICONINFOEX, self).__init__(*args, **kwds)
self.cbSize = ctypes.sizeof(self)
PICONINFO = ctypes.POINTER(ICONINFO)
PICONINFOEX = ctypes.POINTER(ICONINFOEX)
def check_bool(result, func, args):
if not result:
raise ctypes.WinError(ctypes.get_last_error())
return args
kernel32.GetModuleHandleW.errcheck = check_bool
kernel32.GetModuleHandleW.restype = wintypes.HMODULE
kernel32.GetModuleHandleW.argtypes = (
wintypes.LPCWSTR,) # _In_opt_ lpModuleName
# DeleteObject doesn't call SetLastError
gdi32.DeleteObject.restype = wintypes.BOOL
gdi32.DeleteObject.argtypes = (
wintypes.HGDIOBJ,) # _In_ hObject
user32.LoadImageW.errcheck = check_bool
user32.LoadImageW.restype = wintypes.HANDLE
user32.LoadImageW.argtypes = (
wintypes.HINSTANCE, # _In_opt_ hinst
wintypes.LPCWSTR, # _In_ lpszName
wintypes.UINT, # _In_ uType
ctypes.c_int, # _In_ cxDesired
ctypes.c_int, # _In_ cyDesired
wintypes.UINT,) # _In_ fuLoad
user32.DestroyIcon.errcheck = check_bool
user32.DestroyIcon.restype = wintypes.BOOL
user32.DestroyIcon.argtypes
Re: Whittle it on down
On Fri, 6 May 2016 04:27 am, Jussi Piitulainen wrote: > Random832's pattern is fine. You need to use re.fullmatch with it. py> re.fullmatch Traceback (most recent call last): File "", line 1, in AttributeError: 'module' object has no attribute 'fullmatch' -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Pylint prefers list comprehension over filter...
Greetings, Below is the code that I mentioned in an earlier thread. string = "Whiskey Tango Foxtrot" ''.join(list(filter(str.isupper, string))) 'WTF' That works fine and dandy. Except Pylint doesn't like it. According to this link, list comprehensions have replaced filters and the Pylint warning can be disabled. http://stackoverflow.com/questions/3569134/why-doesnt-pylint-like-built-in-functions Here's the replacement code using list comprehension: ''.join([x for x in string if x.isupper()]) Which is one is correct (Pythonic)? Or does it matter? Thank you, Chris R. -- https://mail.python.org/mailman/listinfo/python-list
Re: Pylint prefers list comprehension over filter...
On Fri, May 6, 2016 at 11:26 AM, Christopher Reimer wrote: > Below is the code that I mentioned in an earlier thread. > > string = "Whiskey Tango Foxtrot" > ''.join(list(filter(str.isupper, string))) > > 'WTF' > > That works fine and dandy. Except Pylint doesn't like it. According to this > link, list comprehensions have replaced filters and the Pylint warning can > be disabled. > > http://stackoverflow.com/questions/3569134/why-doesnt-pylint-like-built-in-functions > > Here's the replacement code using list comprehension: > > ''.join([x for x in string if x.isupper()]) > > Which is one is correct (Pythonic)? Or does it matter? Nothing wrong with filter. Since join() is going to iterate over its argument anyway, you don't need the list() call, you can remove that, but you don't have to go for comprehensions: ''.join(filter(str.isupper, string)) Rule of thumb: If the function already exists, use filter or map. If you would be using filter/map with a lambda function, reach for a comprehension instead. In this case, str.isupper exists, so use it! ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Pylint prefers list comprehension over filter...
On Thu, May 5, 2016, at 06:26 PM, Christopher Reimer wrote: > Which is one is correct (Pythonic)? Or does it matter? First, pylint is somewhat opinionated, and its default options shouldn't be taken as gospel. There's no correct: filter is fine. That said, the general consensus is, I believe, that list comprehensions are good, and using them is great. In your case, though, I would not use a list comprehension. I'd use a generator comprehension. It looks almost identical: ''.join(x for x in string if x.isupper()) The difference is, both filter and your list comprehension *build a list* which is not needed, and wasteful. The above skips building a list, instead returning a generator, and join pulls items out of it one at a time as it uses them. No needlessly creating a list only to use it and discard it. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: After a year using Node.js, the prodigal son returns
On 05/04/2016 02:59 AM, Steven D'Aprano wrote: > A year ago, Gavin Vickery decided to move away from Python and give > Javascript with Node.js a try. Twelve months later, he has written about his > experiences: > > > http://geekforbrains.com/post/after-a-year-of-nodejs-in-production Very interesting. Frankly Javascript sounds awful. Even on the front end. -- https://mail.python.org/mailman/listinfo/python-list
Re: Pylint prefers list comprehension over filter...
On Thu, 05 May 2016 18:37:11 -0700, Stephen Hansen wrote: > ''.join(x for x in string if x.isupper()) > The difference is, both filter and your list comprehension *build a > list* which is not needed, and wasteful. The above skips building a > list, instead returning a generator ... filter used to build a list, but now it doesn't (where "used to" means Python 2.7 and "now" means Python 3.5; I'm too lazy to track down the exact point(s) at which it changed): Python 2.7.11+ (default, Apr 17 2016, 14:00:29) [GCC 5.3.1 20160409] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> filter(lambda x:x+1, [1, 2, 3, 4]) [1, 2, 3, 4] Python 3.5.1+ (default, Apr 17 2016, 16:14:06) [GCC 5.3.1 20160409] on linux Type "help", "copyright", "credits" or "license" for more information. >>> filter(lambda x:x+1, [1, 2, 3, 4]) -- https://mail.python.org/mailman/listinfo/python-list
Re: After a year using Node.js, the prodigal son returns
On Fri, May 6, 2016 at 12:49 PM, Michael Torrie wrote: > On 05/04/2016 02:59 AM, Steven D'Aprano wrote: >> A year ago, Gavin Vickery decided to move away from Python and give >> Javascript with Node.js a try. Twelve months later, he has written about his >> experiences: >> >> >> http://geekforbrains.com/post/after-a-year-of-nodejs-in-production > > Very interesting. Frankly Javascript sounds awful. Even on the front end. https://www.destroyallsoftware.com/talks/the-birth-and-death-of-javascript JavaScript is terrible. Really, really bad. And because of that, it has the potential to sweep the world. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Pylint prefers list comprehension over filter...
On Fri, May 6, 2016 at 12:46 PM, Dan Sommers wrote: > filter used to build a list, but now it doesn't (where "used to" means > Python 2.7 and "now" means Python 3.5; I'm too lazy to track down the > exact point(s) at which it changed): > > Python 2.7.11+ (default, Apr 17 2016, 14:00:29) > [GCC 5.3.1 20160409] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> filter(lambda x:x+1, [1, 2, 3, 4]) > [1, 2, 3, 4] > > Python 3.5.1+ (default, Apr 17 2016, 16:14:06) > [GCC 5.3.1 20160409] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> filter(lambda x:x+1, [1, 2, 3, 4]) > Most of these kinds of changes happened in 3.0, where backward-incompatible changes were accepted. A whole bunch of things stopped returning lists and started returning lazy iterables - range, filter/map, dict.keys(), etc - because most of the time, they're iterated over once and then dropped. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Pylint prefers list comprehension over filter...
On Thu, May 5, 2016, at 07:46 PM, Dan Sommers wrote: > On Thu, 05 May 2016 18:37:11 -0700, Stephen Hansen wrote: > > > ''.join(x for x in string if x.isupper()) > > > The difference is, both filter and your list comprehension *build a > > list* which is not needed, and wasteful. The above skips building a > > list, instead returning a generator ... > > filter used to build a list, but now it doesn't (where "used to" means > Python 2.7 and "now" means Python 3.5; I'm too lazy to track down the > exact point(s) at which it changed): Oh, didn't know that. Then again the OP was converting the output of filter *into* a list, which wasted a list either way. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: How to become more motivated to learn Python
The best way to increase your motivation to learn Python is: 1. Select a non-trivial problem that you need to solve with programming. 2. Try to write the program you need in any other language (that you don't already know well). 3. Write the program you need in Python. 4. Gaze in astonishment at the time that you could have saved by skipping step 2. -- https://mail.python.org/mailman/listinfo/python-list
Re: Pylint prefers list comprehension over filter...
On Fri, 06 May 2016 02:46:22 +, Dan Sommers wrote: > Python 2.7.11+ (default, Apr 17 2016, 14:00:29) > [GCC 5.3.1 20160409] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> filter(lambda x:x+1, [1, 2, 3, 4]) > [1, 2, 3, 4] > > Python 3.5.1+ (default, Apr 17 2016, 16:14:06) > [GCC 5.3.1 20160409] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> filter(lambda x:x+1, [1, 2, 3, 4]) > Muphrey's Law strikes again. That lambda function is obviously a leftover from a call to *map* rather than a call to *filter*, but thanks everyone for not laughing and pointing. -- https://mail.python.org/mailman/listinfo/python-list
Re: Pylint prefers list comprehension over filter...
On Fri, May 6, 2016 at 1:07 PM, Dan Sommers wrote: > On Fri, 06 May 2016 02:46:22 +, Dan Sommers wrote: > >> Python 2.7.11+ (default, Apr 17 2016, 14:00:29) >> [GCC 5.3.1 20160409] on linux2 >> Type "help", "copyright", "credits" or "license" for more information. >> >>> filter(lambda x:x+1, [1, 2, 3, 4]) >> [1, 2, 3, 4] >> >> Python 3.5.1+ (default, Apr 17 2016, 16:14:06) >> [GCC 5.3.1 20160409] on linux >> Type "help", "copyright", "credits" or "license" for more information. >> >>> filter(lambda x:x+1, [1, 2, 3, 4]) >> > > Muphrey's Law strikes again. That lambda function is obviously a > leftover from a call to *map* rather than a call to *filter*, but thanks > everyone for not laughing and pointing. Hey, maybe you wanted to filter out all the -1 results. Maybe you have a search function that returns zero-based offsets, or -1 for "not found". Seems reasonable! And "x+1" is way shorter than "x!=-1", which means by definition that it's better. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
Steven D'Aprano writes: > On Fri, 6 May 2016 04:27 am, Jussi Piitulainen wrote: > >> Random832's pattern is fine. You need to use re.fullmatch with it. > > py> re.fullmatch > Traceback (most recent call last): > File "", line 1, in > AttributeError: 'module' object has no attribute 'fullmatch' It's new in version 3.4 (of Python). -- https://mail.python.org/mailman/listinfo/python-list
