Re: [Tutor] name shortening in a csv module output

2015-04-25 Thread Steven D'Aprano
On Fri, Apr 24, 2015 at 01:04:57PM +0200, Laura Creighton wrote: > In a message of Fri, 24 Apr 2015 12:46:20 +1000, "Steven D'Aprano" writes: > >The Japanese, Chinese and Korean > >governments, as well as linguists, are all in agreement that despite a > >few minor differences, the three languages

Re: [Tutor] name shortening in a csv module output

2015-04-25 Thread Jim Mooney
> > > I wouldn't use utf-8-sig for output, however, as it puts the BOM in the > file for others to trip over. > > -- > DaveA Yeah, I found that out when I altered the aliases.py dictionary and added 'ubom' : 'utf_8_sig' as an item. Encoding didn't work out so good, but decoding was fine ;') _

Re: [Tutor] name shortening in a csv module output

2015-04-24 Thread Steven D'Aprano
On Fri, Apr 24, 2015 at 04:34:19PM -0700, Jim Mooney wrote: > I was looking things up and although there are aliases for utf_8 (utf8 and > utf-8) I see no aliases for utf_8_sig, so I'm surprised the utf-8-sig I > tried using, worked at all. Actually, I was trying to find the file where > the alias

Re: [Tutor] name shortening in a csv module output

2015-04-24 Thread Dave Angel
On 04/24/2015 07:34 PM, Jim Mooney wrote: Apparently so. It looks like utf_8-sig just ignores the sig if it is present, and uses UTF-8 whether the signature is present or not. That surprises me. -- Steve I was looking things up and although there are aliases for utf_8 (utf8 and

Re: [Tutor] name shortening in a csv module output

2015-04-24 Thread Jim Mooney
> > Apparently so. It looks like utf_8-sig just ignores the sig if it is > present, and uses UTF-8 whether the signature is present or not. > > That surprises me. > > -- > Steve > > I was looking things up and although there are aliases for utf_8 (utf8 and utf-8) I see no aliases for

Re: [Tutor] name shortening in a csv module output

2015-04-24 Thread Laura Creighton
In a message of Fri, 24 Apr 2015 12:46:20 +1000, "Steven D'Aprano" writes: >The Japanese, Chinese and Korean >governments, as well as linguists, are all in agreement that despite a >few minor differences, the three languages share a common character set. I don't think that is quite the way to sa

Re: [Tutor] name shortening in a csv module output

2015-04-24 Thread Alan Gauld
On 24/04/15 09:54, Alan Gauld wrote: numbers or other symbols so there were two sets of meanings to each pattern and a shift pattern to switch between them (which is why we have SHIFT keys on modern keyboards). Sorry, I'm conflating two sets of issues here. The SHIFT key pre-dated teleprinters

Re: [Tutor] name shortening in a csv module output

2015-04-24 Thread Steven D'Aprano
The quoting seems to be all mangled here, so please excuse me if I misattribute quotes to the wrong person: On Thu, Apr 23, 2015 at 04:15:39PM -0700, Jim Mooney wrote: > So is there any way to sniff the encoding, including the BOM (which appears > to be used or not used randomly for utf-8), so y

Re: [Tutor] name shortening in a csv module output

2015-04-24 Thread Alan Gauld
On 24/04/15 03:46, Steven D'Aprano wrote: Early text encodings all worked in a single byte which is limited to 256 patterns. Oh it's much more complicated than that! Note I said *in* a single byte, ie they were all 8 bits or less. *seven bits*, not even a full byte. It was seven bits so th

Re: [Tutor] name shortening in a csv module output

2015-04-24 Thread Jim Mooney
So is there any way to sniff the encoding, including the BOM (which appears to be used or not used randomly for utf-8), so you can then use the proper encoding, or do you wander in the wilderness? Pretty much guesswork. > Alan Gauld -- This all sounds suspiciously like the old browser wars I suf

Re: [Tutor] name shortening in a csv module output

2015-04-23 Thread Steven D'Aprano
On Fri, Apr 24, 2015 at 12:33:57AM +0100, Alan Gauld wrote: > On 24/04/15 00:15, Jim Mooney wrote: > >Pretty much guesswork. > >Alan Gauld > >-- > >This all sounds suspiciously like the old browser wars > > Its more about history. Early text encodings all worked in a single byte > which is > lim

Re: [Tutor] name shortening in a csv module output

2015-04-23 Thread Steven D'Aprano
On Thu, Apr 23, 2015 at 05:40:34PM -0400, Dave Angel wrote: > On 04/23/2015 05:08 PM, Mark Lawrence wrote: > > > >Slight aside, why a BOM, all I ever think of is Inspector Clouseau? :) > > > > As I recall, it stands for "Byte Order Mark". Applicable only to > multi-byte storage formats (eg. UTF

Re: [Tutor] name shortening in a csv module output

2015-04-23 Thread Steven D'Aprano
On Thu, Apr 23, 2015 at 10:08:05PM +0100, Mark Lawrence wrote: > Slight aside, why a BOM, all I ever think of is Inspector Clouseau? :) :-) I'm not sure if you mean that as an serious question or not. BOM stands for Byte Order Mark, and it if needed for UTF-16 and UTF-32 encodings because the

Re: [Tutor] name shortening in a csv module output

2015-04-23 Thread Steven D'Aprano
On Wed, Apr 22, 2015 at 10:18:31PM -0700, Jim Mooney wrote: > My result: > > Ï»¿First NameLast Name # odd characters on header line Any time you see "odd characters" in text like that, you should immediately think "encoding problem". These odd characters are normally called m

Re: [Tutor] name shortening in a csv module output

2015-04-23 Thread Alan Gauld
On 24/04/15 00:15, Jim Mooney wrote: Pretty much guesswork. Alan Gauld -- This all sounds suspiciously like the old browser wars Its more about history. Early text encodings all worked in a single byte which is limited to 256 patterns. That's simply not enough to cover all the alphabets aroun

Re: [Tutor] name shortening in a csv module output

2015-04-23 Thread Dave Angel
On 04/23/2015 05:08 PM, Mark Lawrence wrote: Slight aside, why a BOM, all I ever think of is Inspector Clouseau? :) As I recall, it stands for "Byte Order Mark". Applicable only to multi-byte storage formats (eg. UTF-16), it lets the reader decide which of the formats were used. For exa

Re: [Tutor] name shortening in a csv module output

2015-04-23 Thread Dave Angel
On 04/23/2015 02:14 PM, Jim Mooney wrote: By relying on the default when you read it, you're making an unspoken assumption about the encoding of the file. -- DaveA So is there any way to sniff the encoding, including the BOM (which appears to be used or not used randomly for utf-8), so you c

Re: [Tutor] name shortening in a csv module output

2015-04-23 Thread Alan Gauld
On 23/04/15 19:14, Jim Mooney wrote: By relying on the default when you read it, you're making an unspoken assumption about the encoding of the file. So is there any way to sniff the encoding, including the BOM (which appears to be used or not used randomly for utf-8), so you can then use th

Re: [Tutor] name shortening in a csv module output

2015-04-23 Thread Mark Lawrence
On 23/04/2015 19:14, Jim Mooney wrote: By relying on the default when you read it, you're making an unspoken assumption about the encoding of the file. -- DaveA So is there any way to sniff the encoding, including the BOM (which appears to be used or not used randomly for utf-8), so you can

Re: [Tutor] name shortening in a csv module output

2015-04-23 Thread Jim Mooney
> > By relying on the default when you read it, you're making an unspoken > assumption about the encoding of the file. > > -- > DaveA So is there any way to sniff the encoding, including the BOM (which appears to be used or not used randomly for utf-8), so you can then use the proper encoding, or

Re: [Tutor] name shortening in a csv module output

2015-04-23 Thread Dave Angel
On 04/23/2015 06:37 AM, Jim Mooney wrote: .. Ï»¿ is the UTF-8 BOM (byte order mark) interpreted as Latin 1. If the input is UTF-8 you can get rid of the BOM with with open("data.txt", encoding="utf-8-sig") as csvfile: Peter Otten I caught the bad arithmetic on name length, but where is t

Re: [Tutor] name shortening in a csv module output

2015-04-23 Thread Peter Otten
Jim Mooney wrote: > .. > >> Ï»¿ >> >> is the UTF-8 BOM (byte order mark) interpreted as Latin 1. >> >> If the input is UTF-8 you can get rid of the BOM with >> >> with open("data.txt", encoding="utf-8-sig") as csvfile: >> > > Peter Otten > > I caught the bad arithmetic on name length, but where

Re: [Tutor] name shortening in a csv module output

2015-04-23 Thread Jim Mooney
.. > Ï»¿ > > is the UTF-8 BOM (byte order mark) interpreted as Latin 1. > > If the input is UTF-8 you can get rid of the BOM with > > with open("data.txt", encoding="utf-8-sig") as csvfile: > Peter Otten I caught the bad arithmetic on name length, but where is the byte order mark coming from? My

Re: [Tutor] name shortening in a csv module output

2015-04-23 Thread Peter Otten
Jim Mooney wrote: > I'm trying the csv module. It all went well until I tried shortening a > long first name I put in just to exercise things. It didn't shorten. > Original file lines: > Stewartrewqrhjeiwqhreqwhreowpqhrueqwphruepqhruepqwhruepwhqupr|Dorsey| nec.malesu...@quisqueporttitoreros.co

[Tutor] name shortening in a csv module output

2015-04-23 Thread Jim Mooney
I'm trying the csv module. It all went well until I tried shortening a long first name I put in just to exercise things. It didn't shorten. And I also got weird first characters on the header line. What went wrong? import csv allcsv = [] with open('data.txt') as csvfile: readCSV = csv.reader(c