Re: Fwd: Python for philosophers

2013-05-14 Thread DJC

On 14/05/13 09:34, Citizen Kant wrote:

2013/5/14 Steven D'Aprano 


On Tue, 14 May 2013 01:32:43 +0200, Citizen Kant wrote:


An entity named Python must be somehow as a serpent. Don't forget that
I'm with the freeing up of my memory, now I'm not trying to follow the
path of what's told but acting like the monkey and pushing with my
finger against the skin of the snake.



Python is not named after the snake, but after Monty Python the British
comedy troupe. And they picked their name because it sounded funny.



http://en.wikipedia.org/wiki/Monty_Python





I'm sorry to hear that. Mostly because, as an answer, seems to example
very well the "taken because I've been told how things are" kind of
actions, which is exactly the opposite of the point I'm trying to state.




Emanual Kant was a real piss ant
Who was very rarely stable
--
http://mail.python.org/mailman/listinfo/python-list


Re: how to calculate reputation

2013-07-06 Thread DJC

On 03/07/13 02:05, Mark Janssen wrote:

Hi all, this seems to be quite stupid question but I am "confused"..
We set the initial value to 0, +1 for up-vote and -1 for down-vote! nice.

I have a list of bool values True, False (True for up vote, False for
down-vote).. submitted by users.

should I take True = +1, False=0  [or] True = +1, False=-1 ?? for adding
all.

I am missing something here.. and that's clear.. anyone please help me on
it?


If False is representing a down-vote, like you say, then you have to
incorporate that information, in which case False=-1  ==> a user not
merely ignored another user, but marked him/her down.



You could express the result as 'x out of y users up-voted this' where x 
= total true and y = x + total_false

--
http://mail.python.org/mailman/listinfo/python-list


Re: Looking for a good introduction to object oriented programming with Python

2012-08-06 Thread DJC

On 06/08/12 02:27, Steven D'Aprano wrote:

On Sun, 05 Aug 2012 19:12:35 -0400, Roy Smith wrote:


Good lord. I'd rather read C++ than UML.  And I can't read C++.


UML is under-rated.  I certainly don't have any love of the 47 different
flavors of diagram, but the basic idea of having a common graphical
language for describing how objects and classes interact is pretty
useful.  Just don't ask me to remember which kind of arrowhead I'm
supposed to use in which situation.



I frequently draw diagrams to understand the relationships between my
classes and the problem I am trying to solve. I almost invariably use one
type of box and one type of arrowhead. Sometimes if I'm bored I draw
doodles on the diagram. If only I could remember to be consistent about
what doodle I draw where, I too could be an UML guru.



Flow Charts redux
--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread DJC

On 19/08/12 15:25, Steven D'Aprano wrote:


Not necessarily. Presumably you're scanning each page into a single
string. Then only the pages containing a supplementary plane char will be
bloated, which is likely to be rare. Especially since I don't expect your
OCR application would recognise many non-BMP characters -- what does
U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software
doesn't recognise it, you can't get it in your output. (If you do, the
OCR software has a nasty bug.)

Anyway, in my ignorant opinion the proper fix here is to tell the OCR
software not to bother trying to recognise Imperial Aramaic, Domino
Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't
expecting them in your source material. Not only will the scanning go
faster, but you'll get fewer wrong characters.


Consider the automated recognition of a CAPTCHA. As the chars have to be 
entered by the user on a keyboard, only the most basic charset can be 
used, so the problem of which chars are possible is quite limited.

--
http://mail.python.org/mailman/listinfo/python-list


Re: color coding for numbers

2012-08-21 Thread DJC

On 21/08/12 12:55, Ulrich Eckhardt wrote:

Am 21.08.2012 10:38, schrieb [email protected]:

what is the best way


Define "best" before asking such questions. ;)




matplotlib.colors

A module for converting numbers or color arguments to RGB or RGBA

RGB and RGBA are sequences of, respectively, 3 or 4 floats in the range 0-1.

This module includes functions and classes for color specification 
conversions, and for mapping numbers to colors in a 1-D array of colors 
called a colormap.


see








using color/shading on a tkinter canvas as a visualization for a
two-dimensional grid of numbers? so far my best idea is to use the
same value for R,G and B (fill = '#xyxyxy'), which gives shades of
gray. if possible i'd like to have a larger number of visually
distinct values.


The basic idea behind this is that you first normalize the values to a
value between zero and one and then use that to look up an according
color in an array. Of course you can also do both in one step or compute
the colors in the array on the fly (like you did), but it helps keeping
things simple at least for a start, and it also allows testing different
approaches separately.

If the different number of resulting colors isn't good enough then, it
could be that the array is too small (its size determines the maximum
number of different colours), that the normalization only uses a small
range between zero and one (reducing the effectively used number of
colours) or simply that your screen doesn't support that many different
colors.


 > i've seen visualizations that seem to use some kind
 > of hot-versus-cold color coding. does anybody know how to do this?

The colour-coding is just the way that above mentioned array is filled.
For the hot/cold coding, you could define a dark blue for low values and
a bright red for high values and then simply interpolate the RGB triple
for values in between.

Uli


--
http://mail.python.org/mailman/listinfo/python-list


Re: A desperate lunge for on-topic-ness

2012-10-21 Thread DJC

On 20/10/12 15:18, Grant Edwards wrote:

On 2012-10-20, Dennis Lee Bieber  wrote:


Strangely, we've gone from 80-character fixed width displays to
who-knows-what (if I drop my font size I can probably get nearly 200
characters across in full-screen mode)...

But at the same time we've gone from 132-character line-printers
using fan-fold 11x17 pages, to office inkjet/laser printers using 8.5x11
paper, defaulting to portrait orientation -- with a 10 character/inch
font, and 1/4" left/right margins, we're back to 80 character limitation



True, but nobody prints source code out on paper do they?


I print source code. Usually when the development has got to a stage 
that the program works but needs a lot of tidying up. It's a lot more 
comfortable than scrolling up and down screen to look through pages from 
the comfort of an armchair. Also I can take the listing to a Café and 
write notes all over it. Sometimes removing the temptation to 
immediately hit the keyboard is a good thing.


--
http://mail.python.org/mailman/listinfo/python-list


sort order for strings of digits

2012-10-31 Thread djc
I learn lots of useful things from the list, some not always welcome. No 
sooner had I found a solution to a minor inconvenience in my code, than 
a recent thread here drew my attention to the fact that it will not work 
for python 3. So suggestions please:


TODO 2012-10-22: sort order numbers first then alphanumeric
>>> n
('1', '10', '101', '3', '40', '31', '13', '2', '2000')
>>> s
('a', 'ab', 'acd', 'bcd', '1a', 'a1', '222 bb', 'b a 4')

>>> sorted(n)
['1', '10', '101', '13', '2', '2000', '3', '31', '40']
>>> sorted(s)
['1a', '222 bb', 'a', 'a1', 'ab', 'acd', 'b a 4', 'bcd']
>>> sorted(n+s)
['1', '10', '101', '13', '1a', '2', '2000', '222 bb', '3', '31', '40', 
'a', 'a1', 'ab', 'acd', 'b a 4', 'bcd']




Possibly there is a better way but for Python 2.7 this gives the 
required result


Python 2.7.3 (default, Sep 26 2012, 21:51:14)

>>> sorted(int(x) if x.isdigit() else x for x in n+s)
[1, 2, 3, 10, 13, 31, 40, 101, 2000, '1a', '222 bb', 'a', 'a1', 'ab', 
'acd', 'b a 4', 'bcd']



[str(x) for x in sorted(int(x) if x.isdigit() else x for x in n+s)]
['1', '2', '3', '10', '13', '31', '40', '101', '2000', '1a', '222 bb', 
'a', 'a1', 'ab', 'acd', 'b a 4', 'bcd']



But not for Python 3
Python 3.2.3 (default, Oct 19 2012, 19:53:16)

>>> sorted(n+s)
['1', '10', '101', '13', '1a', '2', '2000', '222 bb', '3', '31', '40', 
'a', 'a1', 'ab', 'acd', 'b a 4', 'bcd']


>>> sorted(int(x) if x.isdigit() else x for x in n+s)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: unorderable types: str() < int()
>>>

The best I can think of is to split the input sequence into two lists, 
sort each and then join them.



--
djc

--
http://mail.python.org/mailman/listinfo/python-list


Re: sort order for strings of digits

2012-10-31 Thread DJC

On 31/10/12 23:09, Steven D'Aprano wrote:

On Wed, 31 Oct 2012 15:17:14 +0000, djc wrote:


The best I can think of is to split the input sequence into two lists,
sort each and then join them.


According to your example code, you don't have to split the input because
you already have two lists, one filled with numbers and one filled with
strings.


Sorry for the confusion, the pair of strings was just a way of testing 
variations on the input. So a sequence with any combination of strings 
that can be read as numbers and strings of chars that don't look like 
numbers (even if that string includes digits) is the expected input




But I think that what you actually have is a single list of strings, and
you are supposed to sort the strings such that they come in numeric order
first, then alphanumerical. E.g.:

['9', '1000', 'abc2', '55', '1', 'abc', '55a', '1a']
=> ['1', '1a', '9', '55', '55a', '1000', 'abc', 'abc2']


Not quite, what I want is to ensure that if the strings look like 
numbers they are placed in numerical order. ie 1 2 3 10 100 not 1 10 100 
2 3. Cases where a string has some leading digits can be treated as 
strings like any other.



At least that is what I would expect as the useful thing to do when
sorting.


Well it depends on the use case. In my case the strings are column and 
row labels for a report. I want them to be presented in a convenient to 
read sequence. Which the lexical sorting of the strings that look like 
numbers is not. I want a reasonable do-what-i-mean default sort order 
that can handle whatever strings are used.





The trick is to take each string and split it into a leading number and a
trailing alphanumeric string. Either part may be "empty". Here's a pure
Python solution:

from sys import maxsize  # use maxint in Python 2
def split(s):
 for i, c in enumerate(s):
 if not c.isdigit():
 break
 else:  # aligned with the FOR, not the IF
 return (int(s), '')
 return (int(s[:i] or maxsize), s[i:])

Now sort using this as a key function:

py> L = ['9', '1000', 'abc2', '55', '1', 'abc', '55a', '1a']
py> sorted(L, key=split)
['1', '1a', '9', '55', '55a', '1000', 'abc', 'abc2']


The above solution is not quite general:

* it doesn't handle negative numbers or numbers with a decimal point;

* it doesn't handle the empty string in any meaningful way;

* in practice, you may or may not want to ignore leading whitespace,
   or trailing whitespace after the number part;

* there's a subtle bug if a string contains a very large numeric prefix,
   finding and fixing that is left as an exercise.


That looks more than  general enough for my purposes! I will experiment 
along those lines, thank you.



--
http://mail.python.org/mailman/listinfo/python-list


Re: Good use for itertools.dropwhile and itertools.takewhile

2012-12-04 Thread DJC

On 04/12/12 17:18, Alexander Blinne wrote:

Another neat solution with a little help from

http://stackoverflow.com/questions/1701211/python-return-the-index-of-the-first-element-of-a-list-which-makes-a-passed-fun


def split_product(p):

 w = p.split(" ")
 j = (i for i,v in enumerate(w) if v.upper() != v).next()
 return " ".join(w[:j]), " ".join(w[j:])


Python 2.7.3 (default, Sep 26 2012, 21:51:14)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> w1 = "CAPSICUM RED Fresh from Queensland"
>>> w1.split()
['CAPSICUM', 'RED', 'Fresh', 'from', 'Queensland']
>>> w = w1.split()

>>> (i for i,v in enumerate(w) if v.upper() != v)
 at 0x18b1910>
>>> (i for i,v in enumerate(w) if v.upper() != v).next()
2

Python 3.2.3 (default, Oct 19 2012, 19:53:16)

>>> (i for i,v in enumerate(w) if v.upper() != v).next()
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'generator' object has no attribute 'next'

--
http://mail.python.org/mailman/listinfo/python-list


Re: got stuck in equation

2013-01-02 Thread DJC

On 02/01/13 05:20, Ramchandra Apte wrote:

On Tuesday, 1 January 2013 22:55:21 UTC+5:30, Usama Khan  wrote:

am just a begginer bro. . jus learnt if elif while nd for loop. .
can u just display me the coding u want. .it could save my time that i have 
while searchning SN out. . i will give u that dependable variables values. .
now can u give me the coding of this equation as i need to save my time. .. .i 
am learning from tutorials. . so its taking lot of time. kindly consider my 
request nd give me the code so that i can put it. .i dont want to losen up my 
grade. .am trying hard. . .



If you spend half a minute extra to write your sentences properly and not use 
SMS abbreviations, then it would save totally many minutes of other people's 
time.


+1 not to mention that a polite, respectful, and well written request 
might incline more people to offer help.


--
http://mail.python.org/mailman/listinfo/python-list


Re: Thought of the day

2013-01-15 Thread DJC

On 15/01/13 16:48, Antoine Pitrou wrote:

Steven D'Aprano  pearwood.info> writes:


A programmer had a problem, and thought Now he has "I know, I'll solve
two it with threads!" problems.



Host: Last week the Royal Festival Hall saw the first performance of a new
logfile by one of the world's leading modern programmers, Steven
"Two threads" D'Aprano. Mr D'Aprano.

D'Aprano: Hello.

Host: May I just sidetrack for one moment. This -- what shall I call it --
nickname of yours...

D'Aprano: Ah yes.

Host: "Two threads". How did you come by it?

[...]

Host: I see, I see. And you're thinking of spawning this second thread to
write in!

D'Aprano: No, no. Look. This thread business -- it doesn't really matter.
The threads aren't important. A few friends call me Two Threads and that's
all there is to it. I wish you'd ask me about the logfile. Everybody talks
about the threads. They've got it out of proportion -- I'm a programmer.
I'm going to get rid of the thread. I'm fed up with it!

Host: Then you'll be Steven "No Threads" D'Aprano, eh?



+ Applause

--
http://mail.python.org/mailman/listinfo/python-list


Re: Learning Python 2.4

2011-12-21 Thread DJC
On 21/12/11 02:13, Ashton Fagg wrote:

> I got the impression the OP was learning programming in general (i.e.
> from scratch) and not merely "learning Python". If this is the case it
> shouldn't matter if they're merely learning the concepts as you can
> always get up to speed on the differences later on as they get more
> experienced.

In which case the most important thing is the quality of the book as a
text on Programming. If you find the the author's style to your taste,
then use that book rather than struggle with a text based on a recent
version that you personally find unreadable.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Why are timezone aware and naive datetimes not distinct classes?

2013-03-11 Thread djc

On 11/03/13 17:27, Adam Tauno Williams wrote:


Because date/time management in Python is *@*&@R&*(R *@&Y terrible!
Period, full-stop, awful, crappy, lousy, and aggravating.  The design is
haphazard and error inducing.


+1


--
djc

--
http://mail.python.org/mailman/listinfo/python-list


Re: Confusing Algorithm

2013-04-22 Thread DJC

On 22/04/13 13:39, RBotha wrote:

I'm facing the following problem:

"""
In a city of towerblocks, Spiderman can
“cover” all the towers by connecting the
first tower with a spider-thread to the top
of a later tower and then to a next tower
and then to yet another tower until he
reaches the end of the city. Threads are
straight lines and cannot intersect towers.
Your task is to write a program that finds
the minimal number of threads to cover all
the towers. The list of towers is given as a
list of single digits indicating their height.

-Example:
List of towers: 1 5 3 7 2 5 2
Output: 4
"""

I'm not sure how a 'towerblock' could be defined. How square does a shape have 
to be to qualify as a towerblock? Any help on solving this problem?


It's not the algorithm that's confusing, it's the problem. First clarify 
the problem.
This appears to be a variation of the travelling-salesman problem. 
Except the position of the towers is not defined, only their height.
So either the necessary information is missing or whoever set the 
problem intended something else.


--
http://mail.python.org/mailman/listinfo/python-list


simple way to un-nest (flatten?) list

2006-11-05 Thread djc
There is I am sure an easy way to do this, but I seem to be brain dead 
tonight. So:

I have a table such that I can do

  [line for line  in table if line[7]=='JDOC']
and
  [line for line  in table if line[7]=='Aslib']
and
  [line for line  in table if line[7]=='ASLIB']
etc

I also have a dictionary
  r=  {'a':('ASLIB','Aslib'),'j':('JDOC', 'jdoc')}
so I can extract values
r.values()
[('ASLIB', 'Aslib'), ('JDOC', 'jdoc')]

I would like to do

[line for line  in table if line[7] in ('JDOC','jdoc','Aslib','ASLIB')]

so how should I get from
{'a':('ASLIB','Aslib'),'j':('JDOC','jdoc')}
to
('Aslib','ASLIB','JDOC','jdoc')



-- 
djc
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: simple way to un-nest (flatten?) list

2006-11-06 Thread djc
George Sakkis wrote:
 > Meet itertools:
 >
 > from itertools import chain
 > names = set(chain(*r.itervalues()))
 > print [line for line in table if line[7] in names]

Steven D'Aprano wrote:
> Assuming you don't care what order the strings are in:
> 
> r = {'a':('ASLIB','Aslib'),'j':('JDOC','jdoc')}
> result = sum(r.values(), ())
> 
> If you do care about the order:
> 
> r = {'a':('ASLIB','Aslib'),'j':('JDOC','jdoc')}
> keys = r.keys()
> keys.sort()
> result = []
> for key in keys:
> result.extend(r[key])
> result = tuple(result)

Thank you everybody.
As it is possible that the tuples will not always be the same word in 
variant cases
result = sum(r.values(), ())
  will do fine and is as simple as I suspected the answer would be.





-- 
djc
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: simple way to un-nest (flatten?) list

2006-11-07 Thread djc
[EMAIL PROTECTED] wrote:
> It is simple, but I suggest you to take a look at the speed of that
> part of your code into your program. With this you can see the
> difference:
> 
> from time import clock
> d = dict((i,range(300)) for i in xrange(300))
> 
> t = clock()
> r1 = sum(d.values(), [])
> print clock() - t
> 
> t = clock()
> r2 = []
> for v in d.values(): r2.extend(v)
> print clock() - t

Yes, interesting, and well worth noting

1   for v in d.values(): r1.extend(v)

2   from itertools import chain
 set(chain(*d.itervalues()))

3   set(v for t in d.values() for v in t)

4   sum(d.values(), [])

5   reduce((lambda l,v: l+v), d.values())

  on IBM R60e [CoreDuo 1.6MHz/2GB]
  d = dict((i,range(x)) for i in xrange(x))
  x t1  t2  t3t4  t5
300 0.0 0.020.04  0.310.32
500 0.010.090.1   1.671.69
10000.020.3 0.4  16.17   16.15
0.030.280.42 16.37   16.31
15000.030.760.94 57.05   57.13
20000.071.2 1.66136.6   136.97
25000.112.342.64268.44  268.85

but on the other hand, as the intended application is a small command 
line app where x is unlikely to reach double figures and there are only 
two users, myself included:
d = 
{'a':['ASLIB','Aslib'],'j':['JDOC','jdoc'],'x':['test','alt','3rd'],'y':['single',]}
0.0 0.0 0.0 0.0 0.0

And sum(d.values(), []) has the advantage of raising a TypeError in the 
case of a possible mangled input.

{'a':['ASLIB','Aslib'],'j':['JDOC','jdoc'],'x':['test','alt','3rd'],'y':'single'}
  r1
['ASLIB', 'Aslib', 'test', 'alt', '3rd', 'JDOC', 'jdoc', 's', 'i', 'n', 
'g', 'l', 'e']
r2
set(['Aslib', 'JDOC', 'g', '3rd', 'i', 'l', 'n', 'ASLIB', 's', 'test', 
'jdoc', 'alt', 'e'])
r4 = sum(d.values(), [])
TypeError: can only concatenate list (not "str") to list



-- 
djc
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Geohashing

2009-04-29 Thread djc
Raymond Hettinger wrote:
> import hashlib
> 
> def geohash(latitude, longitude, datedow):
> '''Compute geohash() in http://xkcd.com/426/
> 
> >>> geohash(37.421542, -122.085589, b'2005-05-26-10458.68')
> 37.857713 -122.544543
> 
> '''
> h = hashlib.md5(datedow).hexdigest()
> p, q = [('%f' % float.fromhex('0.' + x)) for x in (h[:16], h
> [16:32])]
> print('%d%s %d%s' % (latitude, p[1:], longitude, q[1:]))
> 
> if __name__ == '__main__':
> import doctest
> doctest.testmod()

Python 2.5.2 (r252:60911, Oct  5 2008, 19:29:17)
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>>
>>> def geohash(latitude, longitude, datedow):
... '''Compute geohash() in http://xkcd.com/426/
...
... >>> geohash(37.421542, -122.085589, b'2005-05-26-10458.68')
... 37.857713 -122.544543
...
... '''
... h = hashlib.md5(datedow).hexdigest()
... p, q = [('%f' % float.fromhex('0.' + x)) for x in (h[:16],
h[16:32])]
... print('%d%s %d%s' % (latitude, p[1:], longitude, q[1:]))
...
>>> if __name__ == '__main__':
... import doctest
... doctest.testmod()
...
**
File "__main__", line 4, in __main__.geohash
Failed example:
geohash(37.421542, -122.085589, b'2005-05-26-10458.68')
Exception raised:
Traceback (most recent call last):
  File "/usr/lib/python2.5/doctest.py", line 1228, in __run
compileflags, 1) in test.globs
  File "", line 1
 geohash(37.421542, -122.085589, b'2005-05-26-10458.68')
  ^
 SyntaxError: invalid syntax
**
1 items had failures:
   1 of   1 in __main__.geohash
***Test Failed*** 1 failures.
(1, 1)
>>>


-- 
djc @work
--
http://mail.python.org/mailman/listinfo/python-list


Re: Iterate over group names in a regex match?

2010-01-19 Thread djc
Brian D wrote:
> Here's a simple named group matching pattern:
> 
 s = "1,2,3"
 p = re.compile(r"(?P\d),(?P\d),(?P\d)")
 m = re.match(p, s)
 m
> <_sre.SRE_Match object at 0x011BE610>
 print m.groups()
> ('1', '2', '3')
> 
> Is it possible to call the group names, so that I can iterate over
> them?
> 
> The result I'm looking for would be:
> 
> ('one', 'two', 'three')


 print(m.groupdict())
{'one': '1', 'three': '3', 'two': '2'}

>>> print(m.groupdict().keys())
['one', 'three', 'two']


-- 
David Clark, MSc, PhD.  Dept of Information Studies
Systems & Web Development Manager   University  College  London
UCL Centre for Publishing   Gower Str  London  WCIE 6BT
-- 
http://mail.python.org/mailman/listinfo/python-list


should writing Unicode files be so slow

2010-03-18 Thread djc
I have a simple program to read a text (.csv) file and split it into
several smaller files. Tonight I decided to write a unicode variant and was
surprised at the difference in performance. Is there a better way?

> from __future__ import with_statement
> import codecs
> 
> def _rowreader(filename, separator='\t'):
> """Generator for iteration over potentially large file."""
> with codecs.open(filename, 'rU', 'utf-8', 'backslashreplace') as tabfile: 
>  
> for row in tabfile:
> yield [v.strip() for v in row.split(separator)]  
> 
> def generator_of_output(source_of_lines):
> for line in source_of_lines:
> for result in some_function(line):
> yield result
> 
> def coroutine(outfile_prefix, outfile_suffix, sep='\t'):
> outfile = '%s_%s.txt'%  (outfile_prefix, outfile_suffix)
> with codecs.open(outfile, 'w', 'utf-8') as out_part:
> while True:
> line = (yield)
> out_part.write(sep.join(line) + '\n')
> 
> def _file_to_files(infile, outfile_prefix, column, sep):
> column_values = dict()
> for line in _rowreader(infile, sep):
> outfile_suffix = line[column].strip('\'\"')
> if  outfile_suffix  in  column_values:
> column_values[outfile_suffix].send(line)
> else:
> file_writer = coroutine(outfile_prefix, outfile_suffix, sep)
> file_writer.next()
> file_writer.send(line)
> column_values[outfile_suffix] = file_writer
> for file_writer in column_values.itervalues():
> file_writer.close()

the plain version is the same except for
> with open(filename, 'rU') as tabfile:
> with open(outfile, 'wt') as out_part:


The difference:
> "uid","timestamp","taskid","inputid","value"
> "15473178739336026589","2010-02-18T20:50:15+","11696870405","73827093507","83523277829"
> "15473178739336026589","2010-02-18T20:50:15+","11696870405","11800677379","12192844803"
> "15473178739336026589","2010-02-18T20:50:15+","11696870405","31231839235","52725552133"
> 
> sys...@bembo:~/UCLC/bbc/wb2$ wc -l wb.csv
> 9293271 wb.csv
> 
> normal version
> sys...@bembo:~/UCLC$ time ~/UCL/toolkit/file_splitter.py -o tt --separator 
> comma -k 2 wb.csv
> 
> real  0m43.714s
> user  0m37.370s
> sys   0m2.732s
> 
> unicode version
> sys...@bembo:~/UCLC$ time ./file_splitter.py -o t --separator comma -k 2 
> wb.csv
> 
> real  4m8.695s
> user  3m19.236s
> sys   0m39.262s



-- 
David Clark, MSc, PhD.  UCL Centre for Publishing
Gower Str London WCIE 6BT
What sort of web animal are you?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: should writing Unicode files be so slow

2010-03-19 Thread djc
Ben Finney wrote:
> djc  writes:
> 
>> I have a simple program to read a text (.csv) file
> 
> Could you please:
> 
> * simplify it further: make a minimal version that demonstrates the
>   difference you're seeing, without any extraneous stuff that doesn't
>   appear to affect the result.
> 
> * make it complete: the code you've shown doesn't do anything except
>   define some functions.
> 
> In other words: please reduce it to a complete, minimal example that we
> can run to see the same behaviour you're seeing.
> 


It is the minimal example. The only thing omited is the opt.parse code that
 calls _file_to_files(infile, outfile_prefix, column, sep):


-- 
David Clark, MSc, PhD.  UCL Centre for Publishing
Gower Str London WCIE 6BT
What sort of web animal are you?
<https://www.bbc.co.uk/labuk/experiments/webbehaviour>
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: should writing Unicode files be so slow

2010-03-19 Thread djc
Ben Finney wrote:

> What happens, then, when you make a smaller program that deals with only
> one file?
> 
> What happens when you make a smaller program that only reads the file,
> and doesn't write any? Or a different program that only writes a file,
> and doesn't read any?
> 
> It's these sort of reductions that will help narrow down exactly what
> the problem is. Do make sure that each example is also complete (i.e.
> can be run as is by someone who uses only that code with no additions).
> 


The program reads one csv file of 9,293,271 lines.
869M wb.csv
  It  creates  set  of  files containing the  same  lines  but  where  each
 output file in the set  contains only those lines where the value of a
particular column is the same, the number of  output files will depend on
the number of distinct values  in that column In the example that results
in 19 files

74M tt_11696870405.txt
94M tt_18762175493.txt
15M  tt_28668070915.txt
12M tt_28673313795.txt
15M  tt_28678556675.txt
11M  tt_28683799555.txt
12M  tt_28689042435.txt
15M  tt_28694285315.txt
7.3M  tt_28835845125.txt
6.8M tt_28842136581.txt
12M  tt_28848428037.txt
11M  tt_28853670917.txt
12M  tt_28858913797.txt
15M  tt_28864156677.txt
11M  tt_28869399557.txt
11M  tt_28874642437.txt
283M  tt_31002203141.txt
259M  tt_5282691.txt
45 2010-03-19 17:00 tt_taskid.txt

changing
with open(filename, 'rU') as tabfile:
to
with codecs.open(filename, 'rU', 'utf-8', 'backslashreplace') as tabfile:

and
with open(outfile, 'wt') as out_part:
to
with codecs.open(outfile, 'w', 'utf-8') as out_part:

causes a program that runs  in
43 seconds to take 4 minutes to process the same data. In this particular
case  that is  not very  important, any unicode strings in the data are not
worth troubling over and I have already  spent more  time satisfying
curiousity  that  will  ever  be  required  to  process  the dataset  in
future.  But  I have another  project  in  hand where not only is the
unicode significant but the files are very much larger. Scale up the
problem and the difference between 4 hours and 24 become a matter worth
some attention.



-- 
David Clark, MSc, PhD.  UCL Centre for Publishing
Gower Str London WCIE 6BT
What sort of web animal are you?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: should writing Unicode files be so slow

2010-03-21 Thread djc
Antoine Pitrou wrote:
> Le Fri, 19 Mar 2010 17:18:17 +0000, djc a écrit :
>> changing
>> with open(filename, 'rU') as tabfile: to
>> with codecs.open(filename, 'rU', 'utf-8', 'backslashreplace') as
>> tabfile:
>>
>> and
>> with open(outfile, 'wt') as out_part: to
>> with codecs.open(outfile, 'w', 'utf-8') as out_part:
>>
>> causes a program that runs  in
>> 43 seconds to take 4 minutes to process the same data.
> 
> codecs.open() (and the object it returns) is slow as it is written in 
> pure Python.
> 
> Accelerated reading and writing of unicode files is available in Python 
> 2.7 and 3.1, using the new `io` module.

Thank you, for a clear and to the point explanation. I shall concentrate on
finding an optimal time to upgrade from Python 2.6.


-- 
David Clark, MSc, PhD.  UCL Centre for Publishing
Gower Str London WCIE 6BT
What sort of web animal are you?
<https://www.bbc.co.uk/labuk/experiments/webbehaviour>
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: OT: Meaning of "monkey"

2010-03-29 Thread djc
Mensanator wrote:
> On Mar 26, 2:44 pm, Phlip  wrote:
>> On Mar 26, 6:14 am, Luis M. González  wrote:
>>
>>> Webmonkey, Greasemonkey, monkey-patching, Tracemonkey, Jägermonkey,
>>> Spidermonkey, Mono (monkey in spanish), codemonkey, etc, etc, etc...
>>> Monkeys everywhere.
>>> Sorry for the off topic question, but what does "monkey" mean in a
>>> nerdy-geek context??
>>> Luis
>> Better at typing than thinking.
> 
> Really? I thought it was more of a reference to Eddington, i.e., given
> enough time even a monkey can type out a program.


Precisely, given infinite typing and zero thinking...

Note also the expression 'talk to the organ  grinder not the monkey'

and 'a trained monkey could do it'

and then there are monkey wrenches, and monkey bikes...

and never call the Librarian a monkey


-- 
David Clark, MSc, PhD.  UCL Centre for Publishing
Gower Str London WCIE 6BT
What sort of web animal are you?

-- 
http://mail.python.org/mailman/listinfo/python-list