Where to contribute Unicode General Category encoding/decoding

2012-12-13 Thread Pander Musubi
Hi all,

I have created some handy code to encode and decode Unicode General Categories. 
To which Python Package should I contribute this?

Regards,

Pander
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Where to contribute Unicode General Category encoding/decoding

2012-12-13 Thread Pander Musubi
On Thursday, December 13, 2012 2:22:57 PM UTC+1, Bruno Dupuis wrote:
> On Thu, Dec 13, 2012 at 01:51:00AM -0800, Pander Musubi wrote:
> 
> > Hi all,
> 
> > 
> 
> > I have created some handy code to encode and decode Unicode General 
> > Categories. To which Python Package should I contribute this?
> 
> > 
> 
> 
> 
> Hi,
> 
> 
> 
> As said in a recent thread (a graph data structure IIRC), talking about
> 
> new features is far better if we see the code, so anyone can figure what
> 
> the code really does.
> 
> 
> 
> Can you provide a public repository uri or something?
> 
> 
> 
> Standard lib inclusions are not trivial, it most likely happens for 
> well-known,
> 
> mature, PyPI packages, or battle-tested code patterns. Therefore, it's
> 
> often better to make a package on PyPI, or, if the code is too short, to 
> submit
> 
> your handy chunks on ActiveState. If it deserves a general approbation, it
> 
> may be included in Python stdlib.

I was expecting PyPI. Here is the code, please advise on where to submit it:
  http://pastebin.com/dbzeasyq

> Cheers
> 
> 
> 
> -- 
> 
> Bruno

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Where to contribute Unicode General Category encoding/decoding

2012-12-14 Thread Pander Musubi
On Friday, December 14, 2012 1:06:23 AM UTC+1, Steven D'Aprano wrote:
> On Thu, 13 Dec 2012 07:30:57 -0800, Pander Musubi wrote:
> 
> 
> 
> > I was expecting PyPI. Here is the code, please advise on where to submit
> 
> > it:
> 
> >   http://pastebin.com/dbzeasyq
> 
> 
> 
> If anywhere, either a third-party module, or the unicodedata standard 
> 
> library module.
> 
> 
> 
> 
> 
> Some unanswered questions:
> 
> 
> 
> - when would somebody need this function?
> 

When working with Unicode metedata, see below.

> 
> 
> - why is is called "decodeUnicodeGeneralCategory" when it 
> 
>   doesn't seem to have anything to do with decoding?

It is actually a simple LUT. I like your improvements below.

> - why is the parameter "sortable" called sortable, when it
> 
>   doesn't seem to have anything to do with sorting?

The values return are alphabetically sortable.

> 
> 
> 
> 
> 
> If this is useful at all, it would be more useful to just expose the data 
> 
> as a dict, and forget about an unnecessary wrapper function:
> 
> 
> 
> 
> 
> from collections import namedtuple
> 
> r = namedtuple("record", "other name desc")  # better field names needed!
> 
> 
> 
> GC = {
> 
> 'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),
> 
> 'Cc': r('Control', 'Control', 
> 
> 'a C0 or C1 control code'), # a.k.a. cntrl
> 
> 'Cf': r('Format', 'Format', 'a format control character'),
> 
> 'Cn': r('Unassigned', 'Unassigned', 
> 
> 'a reserved unassigned code point or a noncharacter'),
> 
> 'Co': r('Private Use', 'Private_Use', 'a private-use character'),
> 
> 'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),
> 
> 'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),
> 
> 'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),
> 
> 'Ll': r('Letter, Lowercase', 'Lowercase_Letter', 
> 
> 'a lowercase letter'),
> 
> 'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),
> 
> 'Lo': r('Letter, Other', 'Other_Letter', 
> 
> 'other letters, including syllables and ideographs'),
> 
> 'Lt': r('Letter, Titlecase', 'Titlecase_Letter', 
> 
> 'a digraphic character, with first part uppercase'),
> 
> 'Lu': r('Letter, Uppercase', 'Uppercase_Letter', 
> 
> 'an uppercase letter'),
> 
> 'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark
> 
> 'Mc': r('Mark, Spacing', 'Spacing_Mark', 
> 
> 'a spacing combining mark (positive advance width)'),
> 
> 'Me': r('Mark, Enclosing', 'Enclosing_Mark',
> 
> 'an enclosing combining mark'),
> 
> 'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark', 
> 
> 'a nonspacing combining mark (zero advance width)'),
> 
> 'N' : r('Number', 'Number', 'Nd | Nl | No'),
> 
> 'Nd': r('Number, Decimal', 'Decimal_Number', 
> 
> 'a decimal digit'), # a.k.a. digit
> 
> 'Nl': r('Number, Letter', 'Letter_Number', 
> 
> 'a letterlike numeric character'),
> 
> 'No': r('Number, Other', 'Other_Number',
> 
> 'a numeric character of other type'),
> 
> 'P' : r('Punctuation', 'Punctuation',  
> 
> 'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct
> 
> 'Pc': r('Punctuation, Connector', 'Connector_Punctuation', 
> 
> 'a connecting punctuation mark, like a tie'),
> 
> 'Pd': r('Punctuation, Dash', 'Dash_Punctuation', 
> 
> 'a dash or hyphen punctuation mark'),
> 
> 'Pe': r('Punctuation, Close', 'C

Re: Where to contribute Unicode General Category encoding/decoding

2012-12-14 Thread Pander Musubi
On Friday, December 14, 2012 2:07:51 PM UTC+1, Pander Musubi wrote:
> On Friday, December 14, 2012 1:06:23 AM UTC+1, Steven D'Aprano wrote:
> 
> > On Thu, 13 Dec 2012 07:30:57 -0800, Pander Musubi wrote:
> 
> > 
> 
> > 
> 
> > 
> 
> > > I was expecting PyPI. Here is the code, please advise on where to submit
> 
> > 
> 
> > > it:
> 
> > 
> 
> > >   http://pastebin.com/dbzeasyq
> 
> > 
> 
> > 
> 
> > 
> 
> > If anywhere, either a third-party module, or the unicodedata standard 
> 
> > 
> 
> > library module.
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > Some unanswered questions:
> 
> > 
> 
> > 
> 
> > 
> 
> > - when would somebody need this function?
> 
> > 
> 
> 
> 
> When working with Unicode metedata, see below.
> 
> 
> 
> > 
> 
> > 
> 
> > - why is is called "decodeUnicodeGeneralCategory" when it 
> 
> > 
> 
> >   doesn't seem to have anything to do with decoding?
> 
> 
> 
> It is actually a simple LUT. I like your improvements below.
> 
> 
> 
> > - why is the parameter "sortable" called sortable, when it
> 
> > 
> 
> >   doesn't seem to have anything to do with sorting?
> 
> 
> 
> The values return are alphabetically sortable.
> 
> 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > If this is useful at all, it would be more useful to just expose the data 
> 
> > 
> 
> > as a dict, and forget about an unnecessary wrapper function:
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > from collections import namedtuple
> 
> > 
> 
> > r = namedtuple("record", "other name desc")  # better field names needed!
> 
> > 
> 
> > 
> 
> > 
> 
> > GC = {
> 
> > 
> 
> > 'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),
> 
> > 
> 
> > 'Cc': r('Control', 'Control', 
> 
> > 
> 
> > 'a C0 or C1 control code'), # a.k.a. cntrl
> 
> > 
> 
> > 'Cf': r('Format', 'Format', 'a format control character'),
> 
> > 
> 
> > 'Cn': r('Unassigned', 'Unassigned', 
> 
> > 
> 
> > 'a reserved unassigned code point or a noncharacter'),
> 
> > 
> 
> > 'Co': r('Private Use', 'Private_Use', 'a private-use character'),
> 
> > 
> 
> > 'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),
> 
> > 
> 
> > 'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),
> 
> > 
> 
> > 'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),
> 
> > 
> 
> > 'Ll': r('Letter, Lowercase', 'Lowercase_Letter', 
> 
> > 
> 
> > 'a lowercase letter'),
> 
> > 
> 
> > 'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),
> 
> > 
> 
> > 'Lo': r('Letter, Other', 'Other_Letter', 
> 
> > 
> 
> > 'other letters, including syllables and ideographs'),
> 
> > 
> 
> > 'Lt': r('Letter, Titlecase', 'Titlecase_Letter', 
> 
> > 
> 
> > 'a digraphic character, with first part uppercase'),
> 
> > 
> 
> > 'Lu': r('Letter, Uppercase', 'Uppercase_Letter', 
> 
> > 
> 
> > 'an uppercase letter'),
> 
> > 
> 
> > 'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark
> 
> > 
> 
> > 'Mc': r('Mark, Spacing', 'Spacing_Mark', 
> 
> > 
> 
> > 'a spacing combining mark (positive advance width)'),
> 
> > 
> 
> > 'Me': r('Mark, Enclosing', 'Enclosing_Mark',
> 
> > 
> 
> > 'an enclosing combining mark'),
> 
> > 
> 
> > 'Mn': r('M

Re: Where to contribute Unicode General Category encoding/decoding

2012-12-14 Thread Pander Musubi
On Friday, December 14, 2012 5:22:31 PM UTC+1, Pander Musubi wrote:
> On Friday, December 14, 2012 2:07:51 PM UTC+1, Pander Musubi wrote:
> 
> > On Friday, December 14, 2012 1:06:23 AM UTC+1, Steven D'Aprano wrote:
> 
> > 
> 
> > > On Thu, 13 Dec 2012 07:30:57 -0800, Pander Musubi wrote:
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > > I was expecting PyPI. Here is the code, please advise on where to submit
> 
> > 
> 
> > > 
> 
> > 
> 
> > > > it:
> 
> > 
> 
> > > 
> 
> > 
> 
> > > >   http://pastebin.com/dbzeasyq
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > If anywhere, either a third-party module, or the unicodedata standard 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > library module.
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > Some unanswered questions:
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > - when would somebody need this function?
> 
> > 
> 
> > > 
> 
> > 
> 
> > 
> 
> > 
> 
> > When working with Unicode metedata, see below.
> 
> > 
> 
> > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > - why is is called "decodeUnicodeGeneralCategory" when it 
> 
> > 
> 
> > > 
> 
> > 
> 
> > >   doesn't seem to have anything to do with decoding?
> 
> > 
> 
> > 
> 
> > 
> 
> > It is actually a simple LUT. I like your improvements below.
> 
> > 
> 
> > 
> 
> > 
> 
> > > - why is the parameter "sortable" called sortable, when it
> 
> > 
> 
> > > 
> 
> > 
> 
> > >   doesn't seem to have anything to do with sorting?
> 
> > 
> 
> > 
> 
> > 
> 
> > The values return are alphabetically sortable.
> 
> > 
> 
> > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > If this is useful at all, it would be more useful to just expose the data 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > as a dict, and forget about an unnecessary wrapper function:
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > from collections import namedtuple
> 
> > 
> 
> > > 
> 
> > 
> 
> > > r = namedtuple("record", "other name desc")  # better field names needed!
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > GC = {
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 'Cc': r('Control', 'Control', 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 'a C0 or C1 control code'), # a.k.a. cntrl
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 'Cf': r('Format', 'Format', 'a format control character'),
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 'Cn': r('Unassigned', 'Unassigned', 
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 'a reserved unassigned code point or a noncharacter'),
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 'Co': r('Private Use', 'Private_Use', 'a private-use character'),
> 
> > 
> 
> > > 
> 
> > 
> 
> > > 'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),
> 
&g

Custom alphabetical sort

2012-12-24 Thread Pander Musubi
Hi all,

I would like to sort according to this order:

(' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 
'A', 'ä', 'Ä', 'á', 'Á', 'â', 'Â', 'à', 'À', 'å', 'Å', 'b', 'B', 'c', 'C', 'ç', 
'Ç', 'd', 'D', 'e', 'E', 'ë', 'Ë', 'é', 'É', 'ê', 'Ê', 'è', 'È', 'f', 'F', 'g', 
'G', 'h', 'H', 'i', 'I', 'ï', 'Ï', 'í', 'Í', 'î', 'Î', 'ì', 'Ì', 'j', 'J', 'k', 
'K', 'l', 'L', 'm', 'M', 'n', 'ñ', 'N', 'Ñ', 'o', 'O', 'ö', 'Ö', 'ó', 'Ó', 'ô', 
'Ô', 'ò', 'Ò', 'ø', 'Ø', 'p', 'P', 'q', 'Q', 'r', 'R', 's', 'S', 't', 'T', 'u', 
'U', 'ü', 'Ü', 'ú', 'Ú', 'û', 'Û', 'ù', 'Ù', 'v', 'V', 'w', 'W', 'x', 'X', 'y', 
'Y', 'z', 'Z')

How can I do this? The default sorted() does not give the desired result.

Thanks,

Pander
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Custom alphabetical sort

2012-12-24 Thread Pander Musubi
On Monday, December 24, 2012 5:11:03 PM UTC+1, Thomas Bach wrote:
> On Mon, Dec 24, 2012 at 07:32:56AM -0800, Pander Musubi wrote:
> 
> > I would like to sort according to this order:
> 
> > 
> 
> > (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 
> > 'a', 'A', 'ä', 'Ä', 'á', 'Á', 'â', 'Â', 'à', 'À', 'å', 'Å', 'b', 'B', 'c', 
> > 'C', 'ç', 'Ç', 'd', 'D', 'e', 'E', 'ë', 'Ë', 'é', 'É', 'ê', 'Ê', 'è', 'È', 
> > 'f', 'F', 'g', 'G', 'h', 'H', 'i', 'I', 'ï', 'Ï', 'í', 'Í', 'î', 'Î', 'ì', 
> > 'Ì', 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', 'ñ', 'N', 'Ñ', 'o', 'O', 
> > 'ö', 'Ö', 'ó', 'Ó', 'ô', 'Ô', 'ò', 'Ò', 'ø', 'Ø', 'p', 'P', 'q', 'Q', 'r', 
> > 'R', 's', 'S', 't', 'T', 'u', 'U', 'ü', 'Ü', 'ú', 'Ú', 'û', 'Û', 'ù', 'Ù', 
> > 'v', 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
> 
> > 
> 
> 
> 
> One option is to use sorted's key parameter with an appropriate
> 
> mapping in a dictionary:
> 
> 
> 
> >>> cs = (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', 
> >>> '9', 'a', 'A', 'ä', 'Ä', 'á', 'Á', 'â', 'Â', 'à', 'À', 'å', 'Å', 'b', 
> >>> 'B', 'c', 'C', 'ç', 'Ç', 'd', 'D', 'e', 'E', 'ë', 'Ë', 'é', 'É', 'ê', 
> >>> 'Ê', 'è', 'È', 'f', 'F', 'g', 'G', 'h', 'H', 'i', 'I', 'ï', 'Ï', 'í', 
> >>> 'Í', 'î', 'Î', 'ì', 'Ì', 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', 
> >>> 'ñ', 'N', 'Ñ', 'o', 'O', 'ö', 'Ö', 'ó', 'Ó', 'ô', 'Ô', 'ò', 'Ò', 'ø', 
> >>> 'Ø', 'p', 'P', 'q', 'Q', 'r', 'R', 's', 'S', 't', 'T', 'u', 'U', 'ü', 
> >>> 'Ü', 'ú', 'Ú', 'û', 'Û', 'ù', 'Ù', 'v', 'V', 'w', 'W', 'x', 'X', 'y', 
> >>> 'Y', 'z', 'Z')
> 
> 
> 
> >>> d = { k: v for v, k in enumerate(cs) }
> 
> 
> 
> >>> import random
> 
> 
> 
> >>> ''.join(sorted(random.sample(cs, 20), key=d.get))
> 
> '5aAàÀåBCçËÉíÎLÖøquùx'

This doesn't work for words with more than one character:

>>> test=('øasdf', 'áá', 'aa', 'a123','á1234', 'Aaa', )
>>> sorted(test, key=d.get)
['\xc3\xb8asdf', '\xc3\xa1\xc3\xa1', 'aa', 'a123', '\xc3\xa11234', 'Aaa']


> 
> 
> 
> Regards,
> 
>   Thomas.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Custom alphabetical sort

2012-12-24 Thread Pander Musubi
> > Hi all,
> 
> >
> 
> > I would like to sort according to this order:
> 
> >
> 
> > (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a',
> 
> > 'A', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'b', 'B', 'c', 'C',
> 
> > '?', '?', 'd', 'D', 'e', 'E', '?', '?', '?', '?', '?', '?', '?', '?', 'f',
> 
> > 'F', 'g', 'G', 'h', 'H', 'i', 'I', '?', '?', '?', '?', '?', '?', '?', '?',
> 
> > 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', '?', 'N', '?', 'o', 'O', '?',
> 
> > '?', '?', '?', '?', '?', '?', '?', '?', '?', 'p', 'P', 'q', 'Q', 'r', 'R',
> 
> > 's', 'S', 't', 'T', 'u', 'U', '?', '?', '?', '?', '?', '?', '?', '?', 'v',
> 
> > 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
> 
> >
> 
> > How can I do this? The default sorted() does not give the desired result.
> 
> 
> 
> I'm assuming that doesn't correspond to some standard locale's collating 
> 
> order, so we really do need to roll our own encoding (and that you have 
> 
> a good reason for wanting to do this).

It is for creating a Dutch dictionary. This sorting order is not to be found in 
an existing locale.

>  I'm also assuming that what I'm 
> 
> seeing as question marks are really accented characters in some encoding 
> 
> that my news reader just isn't dealing with (it seems to think your post 
> 
> was in ISO-2022-CN (Simplified Chinese).
> 
> 
> 
> I'm further assuming that you're starting with a list of unicode 
> 
> strings, the contents of which are limited to the above alphabet.

Correct.

>  I'm 
> 
> even further assuming that the volume of data you need to sort is small 
> 
> enough that efficiency is not a huge concern.

Well, it is for 200,000 - 450,000 words but the code is allowed be slow. It 
will not be used for web application or something which requires a quick 
response.

> Given all that, I would start by writing some code which turned your 
> 
> alphabet into a pair of dicts.  One maps from the code point to a 
> 
> collating sequence number (i.e. ordinals), the other maps back.  
> 
> Something like (for python 2.7):
> 
> 
> 
> alphabet = (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5',
> 
> '6', '7', '8', '9', 'a', 'A', '?', '?', '?', '?',
> 
> [...]
> 
> 'v', 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
> 
> 
> 
> map1 = {c: n for n, c in enumerate(alphabet)}
> 
> map2 = {n: c for n, c in enumerate(alphabet)}

OK, similar to Thomas' proposal.

> Next, I would write some functions which encode your strings as lists of 
> 
> ordinals (and back again)
> 
> 
> 
> def encode(s):
> 
>"encode('foo') ==> [34, 19, 19]"  # made-up ordinals
> 
>return [map1[c] for c in s]
> 
> 
> 
> def decode(l):
> 
>"decode([34, 19, 19]) ==> 'foo'"
> 
> return ''.join(map2[i] for i in l)
> 
> 
> 
> Use these to convert your strings to lists of ints which will sort as 
> 
> per your specified collating order, and then back again:
> 
> 
> 
> encoded_strings = [encode(s) for s in original_list]
> 
> encoded_strings.sort()
> 
> sorted_strings = [decode(l) for l in encoded_strings]
> 
> 
> 
> That's just a rough sketch, and completely untested, but it should get 
> 
> you headed in the right direction.  Or at least one plausible direction.  
> 
> Old-time perl hackers will recognize this as the Schwartzian Transform.

I will test it and let you know. :) Pander
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Custom alphabetical sort

2012-12-24 Thread Pander Musubi
> 
> 
> 
> > > I'm assuming that doesn't correspond to some standard locale's collating 
> 
> > > order, so we really do need to roll our own encoding (and that you have 
> 
> > > a good reason for wanting to do this).
> 
> > 
> 
> > It is for creating a Dutch dictionary.
> 
> 
> 
> Wait a minute.  You're telling me that Python, of all languages, doesn't 
> 
> have a built-in way to sort Dutch words???

Not when you want Roman characters with diacritics to be sorted in the normal 
a-Z range.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Custom alphabetical sort

2012-12-24 Thread Pander Musubi
On Monday, December 24, 2012 7:12:43 PM UTC+1, Joshua Landau wrote:
> On 24 December 2012 16:18, Roy Smith  wrote:
> 
> 
> 
> 
> In article <[email protected]>,
> 
>  Pander Musubi  wrote:
> 
> 
> 
> > Hi all,
> 
> 
> >
> 
> > I would like to sort according to this order:
> 
> >
> 
> > (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a',
> 
> > 'A', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'b', 'B', 'c', 'C',
> 
> > '?', '?', 'd', 'D', 'e', 'E', '?', '?', '?', '?', '?', '?', '?', '?', 'f',
> 
> > 'F', 'g', 'G', 'h', 'H', 'i', 'I', '?', '?', '?', '?', '?', '?', '?', '?',
> 
> > 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', '?', 'N', '?', 'o', 'O', '?',
> 
> > '?', '?', '?', '?', '?', '?', '?', '?', '?', 'p', 'P', 'q', 'Q', 'r', 'R',
> 
> > 's', 'S', 't', 'T', 'u', 'U', '?', '?', '?', '?', '?', '?', '?', '?', 'v',
> 
> 
> > 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
> 
> >
> 
> 
> > How can I do this? The default sorted() does not give the desired result.
> 
> 
> 
>  
> 
> 
> 
> 
> Given all that, I would start by writing some code which turned your
> 
> alphabet into a pair of dicts.  One maps from the code point to a
> 
> collating sequence number (i.e. ordinals), the other maps back.
> 
> Something like (for python 2.7):
> 
> 
> 
> alphabet = (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5',
> 
>             '6', '7', '8', '9', 'a', 'A', '?', '?', '?', '?',
> 
>             [...]
> 
> 
>             'v', 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
> 
> 
> 
> map1 = {c: n for n, c in enumerate(alphabet)}
> 
> map2 = {n: c for n, c in enumerate(alphabet)}
> 
> 
> 
> Next, I would write some functions which encode your strings as lists of
> 
> ordinals (and back again)
> 
> 
> 
> def encode(s):
> 
>    "encode('foo') ==> [34, 19, 19]"  # made-up ordinals
> 
>    return [map1[c] for c in s]
> 
> 
> 
> def decode(l):
> 
>    "decode([34, 19, 19]) ==> 'foo'"
> 
>     return ''.join(map2[i] for i in l)
> 
> 
> 
> Use these to convert your strings to lists of ints which will sort as
> 
> per your specified collating order, and then back again:
> 
> 
> 
> encoded_strings = [encode(s) for s in original_list]
> 
> encoded_strings.sort()
> 
> sorted_strings = [decode(l) for l in encoded_strings]
> 
> 
> 
> This isn't needed and the not-so-new way to do this is through .sort's key 
> attribute.
> 
> 
> 
> 
> encoded_strings = [encode(s) for s in original_list]
> encoded_strings.sort()
> sorted_strings = [decode(l) for l in encoded_strings]
> 
> 
> 
> changes to
> 
> 
> 
> 
> encoded_strings.sort(key=encode)
> 
> 
> 
> [Which happens to be faster ]
> 
> 
> 
> 
> Hence you neither need map2 or decode:
> 
> 
> ## CODE ##
> 
> 
> 
> 
> 
> alphabet = (
>   ' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 
>

Re: Custom alphabetical sort

2015-05-02 Thread Pander Musubi
On Monday, 24 December 2012 16:32:56 UTC+1, Pander Musubi  wrote:
> Hi all,
> 
> I would like to sort according to this order:
> 
> (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 
> 'A', 'ä', 'Ä', 'á', 'Á', 'â', 'Â', 'à', 'À', 'å', 'Å', 'b', 'B', 'c', 'C', 
> 'ç', 'Ç', 'd', 'D', 'e', 'E', 'ë', 'Ë', 'é', 'É', 'ê', 'Ê', 'è', 'È', 'f', 
> 'F', 'g', 'G', 'h', 'H', 'i', 'I', 'ï', 'Ï', 'í', 'Í', 'î', 'Î', 'ì', 'Ì', 
> 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', 'ñ', 'N', 'Ñ', 'o', 'O', 'ö', 
> 'Ö', 'ó', 'Ó', 'ô', 'Ô', 'ò', 'Ò', 'ø', 'Ø', 'p', 'P', 'q', 'Q', 'r', 'R', 
> 's', 'S', 't', 'T', 'u', 'U', 'ü', 'Ü', 'ú', 'Ú', 'û', 'Û', 'ù', 'Ù', 'v', 
> 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
> 
> How can I do this? The default sorted() does not give the desired result.
> 
> Thanks,
> 
> Pander

Meanwhile Python 3 supports locale aware sorting, see 
https://docs.python.org/3/howto/sorting.html
-- 
https://mail.python.org/mailman/listinfo/python-list