[Tutor] 1 to N searches in files

2012-12-02 Thread Spectral None
Hi all

I have two files (File A and File B) with strings of data in them (each string 
on a separate line). Basically, each string in File B will be compared with all 
the strings in File A and the resulting output is to show a list of 
matched/unmatched lines and optionally to write to a third File C

File A: Unique strings
File B: Can have duplicate strings (that is, "string1" may appear more than 
once)

My code currently looks like this:

-
FirstFile = open('C:\FileA.txt', 'r')
SecondFile = open('C:\FileB.txt', 'r')
ThirdFile = open('C:\FileC.txt', 'w')

a = FirstFile.readlines()
b = SecondFile.readlines()

mydiff = difflib.Differ()
results = mydiff(a,b)
print("\n".join(results))

#ThirdFile.writelines(results)

FirstFile.close()
SecondFile.close()
ThirdFile.close()
-

However, it seems that the results do not correctly reflect the 
matched/unmatched lines. As an example, if FileA contains "string1" and FileB 
contains multiple occurrences of "string1", it seems that the first occurrence 
matches correctly but subsequent "string1"s are treated as unmatched strings.

I am thinking perhaps I don't understand Differ() that well and that it is not 
doing what I hoped to do? Is Differ() comparing first line to first line and 
second line to second line etc in contrast to what I wanted to do?

Regards
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] 1 to N searches in files

2012-12-02 Thread Steven D'Aprano

On 02/12/12 19:53, Spectral None wrote:


However, it seems that the results do not correctly reflect the
matched/unmatched lines. As an example, if FileA contains "string1"
and FileB contains multiple occurrences of "string1", it seems that
 the first occurrence matches correctly but subsequent "string1"s
are treated as unmatched strings.

I am thinking perhaps I don't understand Differ() that well and that
 it is not doing what I hoped to do? Is Differ() comparing first line
 to first line and second line to second line etc in contrast to what
 I wanted to do?


No, and yes.

No, it is not comparing first line to first line.

And yes, it is acting in contrast to what you hope to do, otherwise you
wouldn't be asking the question :-)

Unfortunately, you don't explain what it is that you hope to do, so I'm
going to have to guess. See below.

difflib is used for find differences between two files. It will try to
find a set of changes which will turn file A into file B, e.g:

insert this line here
delete this line there
...


and repeated as many times as needed. Except that difflib.Differ uses
a shorthand of "+" and "-" to indicate adding and deleting lines.

You can find out more about difflib and Differ objects by reading the
Fine Manual. Open a Python interactive shell, and do this:

import difflib
help(difflib.Differ)


If you have any questions, please feel free to ask.

In the code sample you give, you say you do this:

mydiff = difflib.Differ()
results = mydiff(a,b)

but that doesn't work, Differ objects are not callable. Please do not
paraphrase your code. Copy and paste the exact code you have actually
run, don't try to type it out from memory.

Now, I *guess* that what you are trying to do is something like this...
given files A and B:


# file A
spam
ham
eggs
tomato


# file B
tomato
spam
eggs
cheese
spam
spam


you want to generate three lists:

# lines in B that were also in A:
tomato
spam
eggs


# lines in B that were not in A:
cheese


# lines in A that were not found in B:
ham


Am I close?

If not, please explain with an example what you are trying
to do.


--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] 1 to N searches in files

2012-12-02 Thread Dave Angel
On 12/02/2012 03:53 AM, Spectral None wrote:
> Hi all
>
> I have two files (File A and File B) with strings of data in them (each 
> string on a separate line). Basically, each string in File B will be compared 
> with all the strings in File A and the resulting output is to show a list of 
> matched/unmatched lines and optionally to write to a third File C
>
> File A: Unique strings
> File B: Can have duplicate strings (that is, "string1" may appear more than 
> once)
>
> My code currently looks like this:
>
> -
> FirstFile = open('C:\FileA.txt', 'r')
> SecondFile = open('C:\FileB.txt', 'r')
> ThirdFile = open('C:\FileC.txt', 'w')
>
> a = FirstFile.readlines()
> b = SecondFile.readlines()
>
> mydiff = difflib.Differ()
> results = mydiff(a,b)
> print("\n".join(results))
>
> #ThirdFile.writelines(results)
>
> FirstFile.close()
> SecondFile.close()
> ThirdFile.close()
> -
>
> However, it seems that the results do not correctly reflect the 
> matched/unmatched lines. As an example, if FileA contains "string1" and FileB 
> contains multiple occurrences of "string1", it seems that the first 
> occurrence matches correctly but subsequent "string1"s are treated as 
> unmatched strings.
>
> I am thinking perhaps I don't understand Differ() that well and that it is 
> not doing what I hoped to do? Is Differ() comparing first line to first line 
> and second line to second line etc in contrast to what I wanted to do?
>
> Regards
>
>
Let me guess your goal, and then, on that assumption, discuss your code.


I think your File A is supposed to be a dictionary of valid words
(strings).  You want to process File B, checking each line against that
dictionary, and make a list of which lines are "valid" (in the
dictionary), and another of which lines are not (missing from the
dictionary).   That's one list for matched lines, and one for unmatched.

That isn't even close to what difflib does.  This can be solved with
minimal code, but not by starting with difflib.

What you should do is to loop through File A, adding all the lines to a
set called valid_dictionary.  Calling set(FirstFile) can do that in one
line, without even calling readlines().
Then a simple loop can build the desired lists.  The matched_lines is
simply all lines which are in the dictionary, while unmatched_lines are
those which are not.

The heart of the comparison could simply look like:

   if line in valid_dictionary:
 matched_lines.append(line)
   else:
 unmatched_lines.append(line)


-- 

DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] how to struct.pack a unicode string?

2012-12-02 Thread Albert-Jan Roskam
>> How can I pack a unicode string using the struct module? If I simply use

>> packed = struct.pack(fmt, hello) in the code below (and 'hello' is a
>> unicode string), I get this: "error: argument for 's' must be a string". I
>> keep reading that I have to encode it to a utf-8 bytestring, but this does
>> not work (it yields mojibake and tofu output for some of the languages).
>
>You keep reading it because it is the right approach. You will not get 
>mojibake if you decode the "packed" data before using it. 
>
>Your code basically becomes
>
>for greet in greetings:
>    language, chars, encoding = greet
>    hello = "".join([unichr(i) for i in chars])
>    packed = hello.encode("utf-8")
>    unpacked = packed.decode("utf-8")
>    print unpacked
>
>I don't know why you mess with byte order, perhaps you can tell a bit about 
>your actual use-case.


Hi Peter,

Thanks for helping me. I am writing binary files and I wanted to create test 
data for this.
--this has been a good test case, such that (a) it demonstrated a defect in my 
program (b) idem, my knowledge. I realize how cp2152-ish I am; for instance, I 
wrongly tend to assume that len(someUnicodeString) == 
nbytes_of_that_unicode_string.

--re: messing with byte order: I read in M. Summerfield's "Programming in 
Python 3" that it's advisable to always specify the byte order, for portability 
of the data. But, now that you mention it, the way I did it, I might as well 
omit it. Or, given that the binary format I am writing contains information 
about the byte order, I might hard-code the byte order (e.g. always write LE). 
That would follow Mark Summerfield's advise, if I understand it correctly.
--(Aside from your advise to use utf-8) Given that sys.maxunicode == 65535 on 
my system (ie, that many unicode points can be represented in my compilation of 
Python) I'd expect that I not only could write 
u'blaah'.encode("unicode-internal"), but also u'blaah'.encode("ucs-2")
Traceback (most recent call last):
  File "", line 1, in 
    u'blaah'.encode("ucs-2")
LookupError: unknown encoding: ucs-2
Why is the label "unicode-internal" to indicate both ucs-2 and ucs-4? And why 
does the same Python version on my Linux computer use 1114111 code points? Can 
we conclude that Linux users are better equiped to write a letter in Birmese or 
Aleut? ;-)

Thanks again!

Regards,
Albert-Jan

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] how to struct.pack a unicode string?

2012-12-02 Thread Albert-Jan Roskam


> 
> * some encodings are more compact than others (e.g. Latin-1 uses
>   one byte per character, while UTF-32 uses four bytes per
>   character).

I read that performance of UTF32 is better ("UTF-32 advantage: you don't need 
to decode 
stored data to the 32-bit Unicode 
code point for e.g. character by 
character handling. The code point is already available right there in 
your array/vector/string.").
http://stackoverflow.com/questions/496321/utf8-utf16-and-utf32
But given that utf-32 is a memory hog, should one conclude that it's usually 
not a good idea to use it (esp. in Python)?
 
>>  but this does not work (it yields mojibake and tofu output for
>>  some of the languages).
> 
> It would be useful to see an example of this.
> 
> But if you do your encoding/decoding correctly, using the right
> codecs, you should never get mojibake. You only get that when
> you have a mismatch between the encoding you think you have and
> the encoding you actually have.
> 
> 
>>  It's annoying if one needs to know the encoding in which each
>>  individual language should be represented. I was hoping
>>  "unicode-internal" was the way to do it, but this does not
>>  reproduce the original string when I unpack it.. :-(
> 
> Yes, encodings are annoying. The sooner that all encodings other
> than UTF-8 and UTF-32 disappear the better :)

So true ;-)

> The beauty of using UTF-8 instead of one of the many legacy
> encodings is that UTF-8 can represent any character, so you don't
> need to care about the individual language, and it is compact (at
> least for Western European languages).

Later you write "You need a variable-length struct, of course.". Is this 
because ASCII is a subset of UTF-8?
The thing is, the the binary format I am writing (spss .sav), uses *fixed* 
column widths. This means that, even 
when I only use the ascii subset of utf-8, I still need to assume the 
worst-case-scenario, namely 3 bytes per symbol, right?
 
> Why are you using struct for this? If you want to convert Unicode
> strings into a sequence of bytes, that's exactly what the encode
> method does. There's no need for struct.
 
I am using struct to read/write binary data. I created the ' greetings' code to 
test my program (and my knowledge).
As I said to Peter Otten, both were/are imperfect ;-). Struct needs a 
bytestring, not a unicode string, hence I needed to convert
my unicode strings first. I used these languages because I suspected I often 
get away with errors because 'my' encoding
(cp1252) is fairly easy.
 
> greetings = [
>         ('Arabic', 
> u'\u0627\u0644\u0633\u0644\u0627\u0645\u0020\u0639\u0644\u064a\u0643\u0645', 
> 'cp1256'),
>         ('Assamese', 
> u'\u09a8\u09ae\u09b8\u09cd\u0995\u09be\u09f0', 
> 'utf-8'),
>         ('Bengali', 
> u'\u0986\u09b8\u09b8\u09be\u09b2\u09be\u09ae\u09c1 
> \u0986\u09b2\u09be\u0987\u0995\u09c1\u09ae', 
> 'utf-8'),
>         ('English', u'Greetings and salutations', 
> 'ascii'),
>         ('Georgian', 
> u'\u10d2\u10d0\u10db\u10d0\u10e0\u10ef\u10dd\u10d1\u10d0', 
> 'utf-8'),
>         ('Kazakh', 
> u'\u0421\u04d9\u043b\u0435\u043c\u0435\u0442\u0441\u0456\u0437 
> \u0431\u0435', 'utf-8'),
>         ('Russian', 
> u'\u0417\u0434\u0440\u0430\u0432\u0441\u0442\u0432\u0443\u0439\u0442\u0435', 
> 'utf-8'),
>         ('Spanish', u'\xa1Hola!', 'cp1252'),
>         ('Swiss German', u'Gr\xfcezi', 'cp1252'),
>         ('Thai', 
> u'\u0e2a\u0e27\u0e31\u0e2a\u0e14\u0e35', 
> 'cp874'),
>         ('Walloon', u'Bondjo\xfb', 'cp1252'),
>         ]
> for language, greet, encoding in greetings:
>     print u"Hello in %s: %s" % (language, greet)
>     for enc in ('utf-8', 'utf-16', 'utf-32', encoding):
>         bytestring = greet.encode(enc)
>         print "encoded as %s gives %r" % (enc, bytestring)
>         if bytestring.decode(enc) != greet:
>             print "*** round-trip encoding/decoding failed ***"
> 
> 
> Any of the byte strings can then be written directly to a file:
> 
> f.write(bytestring)
> 
> or embedded into a struct. You need a variable-length struct, of course.
 
See above. I believe I've got it working for character data already; now I 
still need to check whether I can also store 
e.g. Chinese metadata in my spss file.

> My advice: stick to Python unicode strings internally, and always write
> them to files as UTF-8.


Thanks Steven, I appreciate it! 
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] how to struct.pack a unicode string?

2012-12-02 Thread Albert-Jan Roskam


 




> to make is that the transform formats are multibyte encodings (except
> ASCII in UTF-8), which means the expression str(len(hello)) is using
> the wrong length; it needs to use the length of the encoded string.
> Also, UTF-16 and UTF-32 typically have very many null bytes. Together,
> these two observations explain the error: "unicode_internal' codec
> can't decode byte 0x00 in position 12: truncated input".

Hi Eryksun,

Observation #1: Yes, makes perfect sense. I should have thought about that. 
Observation #2:
As I emailed earlier today to Peter Otten, I thought unicode_internal means 
UCS-2 or UCS-4,
depending on the size of sys.maxunicode? How is this related to UTF-16 and 
UTF-32?

Thank you!

Best regards,
Albert-Jan

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] how to struct.pack a unicode string?

2012-12-02 Thread Dave Angel
On 12/02/2012 08:34 AM, Albert-Jan Roskam wrote:
>
>  
>
> 
>
>
> Hi Eryksun,
>
> Observation #1: Yes, makes perfect sense. I should have thought about that. 
> Observation #2:
> As I emailed earlier today to Peter Otten, I thought unicode_internal means 
> UCS-2 or UCS-4,
> depending on the size of sys.maxunicode? How is this related to UTF-16 and 
> UTF-32?

How is maxunicode relevant?  Are you stuck on 3.2 or something?  Python
3.3 uses 1 byte, 2 bytes or 4 for internal storage of a string depending
only upon the needs of that particular string.



-- 

DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] FW: (no subject)

2012-12-02 Thread Ashfaq
Luke,

Thanks. The generator syntax is really cool.

--
Ashfaq
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Help with writing a program

2012-12-02 Thread rajesh mullings
Hello, I am trying to write a program which takes two lines of input, one
called "a", and one called "b", which are both strings, then outputs the
number of times a is a substring of b. If you could give me an
algorithm/pseudo code of what I should do to create this program, I would
greatly appreciate that. Thank you for using your time to consider my
request.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Help with writing a program

2012-12-02 Thread Mark Lawrence

On 03/12/2012 03:59, rajesh mullings wrote:

Hello, I am trying to write a program which takes two lines of input, one
called "a", and one called "b", which are both strings, then outputs the
number of times a is a substring of b. If you could give me an
algorithm/pseudo code of what I should do to create this program, I would
greatly appreciate that. Thank you for using your time to consider my
request.



Start here http://docs.python.org/2/library/string.html

--
Cheers.

Mark Lawrence.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] reverse diagonal

2012-12-02 Thread eryksun
On Sun, Dec 2, 2012 at 2:32 AM, Steven D'Aprano  wrote:
>
>> ~i returns the value (-i - 1):
>
> Assuming certain implementation details about how integers are stored,
> namely that they are two-compliment rather than one-compliment or
> something more exotic.

Yes, the result is platform dependent, at least for the 2.x int type.
I saw it in someone else's code or blog a while ago and thought I'd
pass it along as a novelty and something to keep an eye out for.

A multiprecision long might qualify as exotic. It uses sign-magnitude
form. The sign of the number and the length of ob_digit are both
stored in ob_size. For the invert op, it adds 1 and negates the sign
to emulate 2's complement:

http://hg.python.org/cpython/file/70274d53c1dd/Objects/longobject.c#l3566

Further along is more 2's complement emulation for bitwise &, |, and ^:

http://hg.python.org/cpython/file/70274d53c1dd/Objects/longobject.c#l3743

> Okay, just about every computer made since 1960 uses two-compliment
> integers, but still, the effect of ~i depends on the way integers are
> represented internally rather than some property of integers as an
> abstract number. That makes it a code smell.

It relies on integer modulo arithmetic. The internal base is arbitrary
and not apparent. It could be 10s complement on some hypothetical base
10 computer. In terms of a Python sequence, you could use unsigned
indices such as [0,1,2,3,4,5,6,7] or the N=8 complement indices
[0,1,2,3,-4,-3,-2,-1], where -1 % 8 == 7, and so on. The invert op can
be generalized as N-1-i for any N-length window on the integers (e.g.
5-digit base 10, where N=10**5, subtract i from N-1 == 9), which
just inverts the sequence order. The interpretation of this as
negative number depends on a signed type that represents negative
values as modulo N. That's common because it's a simple shift of the
window to be symmetric about 0 (well, almost symmetric for even N);
the modulo arithmetic is easy and there's no negative 0. However, with
a multiprecision integer type, it's simpler to use a sign magnitude
representation.

That said, I don't want to give the impression that I disagree with
you. You're right that it isn't generally advisable to use a single
operation instead of two or three if it sacrifices clarity and
portability. It didn't jump out at me as a problem since I take 2s
complement for granted and have a bias to favor symmetry and
minimalism.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Help with writing a program

2012-12-02 Thread fantasticrm
The Python version, is Python 3.

On Sun, Dec 2, 2012 at 10:59 PM, rajesh mullings wrote:

> Hello, I am trying to write a program which takes two lines of input, one
> called "a", and one called "b", which are both strings, then outputs the
> number of times a is a substring of b. If you could give me an
> algorithm/pseudo code of what I should do to create this program, I would
> greatly appreciate that. Thank you for using your time to consider my
> request.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] how to struct.pack a unicode string?

2012-12-02 Thread eryksun
On Sun, Dec 2, 2012 at 8:34 AM, Albert-Jan Roskam  wrote:
>
> As I emailed earlier today to Peter Otten, I thought unicode_internal means
> UCS-2 or UCS-4, depending on the size of sys.maxunicode? How is this related
> to UTF-16 and UTF-32?

UCS is the universal character set. Some highlights of the Basic
Multilingual Plane (BMP): U+-U+00FF is Latin-1 (including the C0
and C1 control codes). U+D800-U+DFFF is reserved for UTF-16 surrogate
pairs. U+E000-U+F8FF is reserved for private use. Most of
U+F900-U+ is assigned. Notably U+FEFF (zero width no-break space)
doubles as the BOM/signature in the transformation formats.

UTF-16 encodes the supplementary planes by using 2 codes as a
surrogate pair. This uses a reserved 11-bit block (U+D800-U+DFFF),
which is split into two 10-bit ranges: U+D800-U+DBFF for the lead
surrogate and U+DC00-U+DFFF for the trail surrogate. Together that's
the required 20 bits for the 16 supplementary planes. Including the
BMP, this scheme covers the complete UCS range of 17 * 2**16 ==
1114112 codes (on a wide build, that's sys.maxunicode + 1).

For encoding text, use one of the transformation formats such as
UTF-8, UTF-16, or UTF-32. Unless you have a requirement to use UTF-16
or UTF-32, it's best to stick to encoding to UTF-8. It's the default
encoding in 3.x. It's also generally the most compact representation
(especially if there's a lot of ASCII) and compatible with
null-terminated byte strings (i.e. C array of char, terminated by
NUL). Regardless of narrow vs wide build, you can always encode to one
of these formats. The encoders for UTF-8 and UTF-32 first recombine
any surrogate pairs in the internal representation.

CPython 3.3 has a new implementation that angles for the best of all
worlds, opting for a 1-byte, 2 byte, or 4-byte representation
depending on the maximum code in the string. The internal
representation doesn't use surrogates, so there's no more narrow vs
wide build distinction.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] FW: (no subject)

2012-12-02 Thread Luke Paireepinart
On Sun, Dec 2, 2012 at 8:41 PM, Ashfaq  wrote:

> Luke,
>
> Thanks. The generator syntax is really cool.
>

I misspoke, the correct term is "list comprehension".  A generator is
something totally different!  Sorry about the confusion, my fault.  I type
too fast sometimes :)
Glad you liked it though.

-Luke
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Help with writing a program

2012-12-02 Thread Luke Paireepinart
There is an equivalent page in the documentation for Python 3 as well,
regarding strings.

This sounds a lot like a homework problem so you are unlikely to get a lot
of help.  You certainly won't get exact code.

What have you tried so far?  Where are you getting stuck?  We're not here
to write code for you, this list is meant to help you learn something
yourself.  If you just want someone to write code for you there are plenty
of sites that will do that.  But if you want to figure it out I'd be happy
to give you some hints if I can see that you're making some effort.  One
effort you could make would be to find the relevant Python 3 document
discussing strings and check if it has some references to finding
substrings.

Let me know what you try and I'll help you if you get stuck.

Thanks,
-Luke


On Sun, Dec 2, 2012 at 11:31 PM, fantasticrm  wrote:

> The Python version, is Python 3.
>
>
> On Sun, Dec 2, 2012 at 10:59 PM, rajesh mullings wrote:
>
>> Hello, I am trying to write a program which takes two lines of input, one
>> called "a", and one called "b", which are both strings, then outputs the
>> number of times a is a substring of b. If you could give me an
>> algorithm/pseudo code of what I should do to create this program, I would
>> greatly appreciate that. Thank you for using your time to consider my
>> request.
>
>
>
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor