[Tutor] 1 to N searches in files
Hi all I have two files (File A and File B) with strings of data in them (each string on a separate line). Basically, each string in File B will be compared with all the strings in File A and the resulting output is to show a list of matched/unmatched lines and optionally to write to a third File C File A: Unique strings File B: Can have duplicate strings (that is, "string1" may appear more than once) My code currently looks like this: - FirstFile = open('C:\FileA.txt', 'r') SecondFile = open('C:\FileB.txt', 'r') ThirdFile = open('C:\FileC.txt', 'w') a = FirstFile.readlines() b = SecondFile.readlines() mydiff = difflib.Differ() results = mydiff(a,b) print("\n".join(results)) #ThirdFile.writelines(results) FirstFile.close() SecondFile.close() ThirdFile.close() - However, it seems that the results do not correctly reflect the matched/unmatched lines. As an example, if FileA contains "string1" and FileB contains multiple occurrences of "string1", it seems that the first occurrence matches correctly but subsequent "string1"s are treated as unmatched strings. I am thinking perhaps I don't understand Differ() that well and that it is not doing what I hoped to do? Is Differ() comparing first line to first line and second line to second line etc in contrast to what I wanted to do? Regards ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] 1 to N searches in files
On 02/12/12 19:53, Spectral None wrote: However, it seems that the results do not correctly reflect the matched/unmatched lines. As an example, if FileA contains "string1" and FileB contains multiple occurrences of "string1", it seems that the first occurrence matches correctly but subsequent "string1"s are treated as unmatched strings. I am thinking perhaps I don't understand Differ() that well and that it is not doing what I hoped to do? Is Differ() comparing first line to first line and second line to second line etc in contrast to what I wanted to do? No, and yes. No, it is not comparing first line to first line. And yes, it is acting in contrast to what you hope to do, otherwise you wouldn't be asking the question :-) Unfortunately, you don't explain what it is that you hope to do, so I'm going to have to guess. See below. difflib is used for find differences between two files. It will try to find a set of changes which will turn file A into file B, e.g: insert this line here delete this line there ... and repeated as many times as needed. Except that difflib.Differ uses a shorthand of "+" and "-" to indicate adding and deleting lines. You can find out more about difflib and Differ objects by reading the Fine Manual. Open a Python interactive shell, and do this: import difflib help(difflib.Differ) If you have any questions, please feel free to ask. In the code sample you give, you say you do this: mydiff = difflib.Differ() results = mydiff(a,b) but that doesn't work, Differ objects are not callable. Please do not paraphrase your code. Copy and paste the exact code you have actually run, don't try to type it out from memory. Now, I *guess* that what you are trying to do is something like this... given files A and B: # file A spam ham eggs tomato # file B tomato spam eggs cheese spam spam you want to generate three lists: # lines in B that were also in A: tomato spam eggs # lines in B that were not in A: cheese # lines in A that were not found in B: ham Am I close? If not, please explain with an example what you are trying to do. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] 1 to N searches in files
On 12/02/2012 03:53 AM, Spectral None wrote: > Hi all > > I have two files (File A and File B) with strings of data in them (each > string on a separate line). Basically, each string in File B will be compared > with all the strings in File A and the resulting output is to show a list of > matched/unmatched lines and optionally to write to a third File C > > File A: Unique strings > File B: Can have duplicate strings (that is, "string1" may appear more than > once) > > My code currently looks like this: > > - > FirstFile = open('C:\FileA.txt', 'r') > SecondFile = open('C:\FileB.txt', 'r') > ThirdFile = open('C:\FileC.txt', 'w') > > a = FirstFile.readlines() > b = SecondFile.readlines() > > mydiff = difflib.Differ() > results = mydiff(a,b) > print("\n".join(results)) > > #ThirdFile.writelines(results) > > FirstFile.close() > SecondFile.close() > ThirdFile.close() > - > > However, it seems that the results do not correctly reflect the > matched/unmatched lines. As an example, if FileA contains "string1" and FileB > contains multiple occurrences of "string1", it seems that the first > occurrence matches correctly but subsequent "string1"s are treated as > unmatched strings. > > I am thinking perhaps I don't understand Differ() that well and that it is > not doing what I hoped to do? Is Differ() comparing first line to first line > and second line to second line etc in contrast to what I wanted to do? > > Regards > > Let me guess your goal, and then, on that assumption, discuss your code. I think your File A is supposed to be a dictionary of valid words (strings). You want to process File B, checking each line against that dictionary, and make a list of which lines are "valid" (in the dictionary), and another of which lines are not (missing from the dictionary). That's one list for matched lines, and one for unmatched. That isn't even close to what difflib does. This can be solved with minimal code, but not by starting with difflib. What you should do is to loop through File A, adding all the lines to a set called valid_dictionary. Calling set(FirstFile) can do that in one line, without even calling readlines(). Then a simple loop can build the desired lists. The matched_lines is simply all lines which are in the dictionary, while unmatched_lines are those which are not. The heart of the comparison could simply look like: if line in valid_dictionary: matched_lines.append(line) else: unmatched_lines.append(line) -- DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to struct.pack a unicode string?
>> How can I pack a unicode string using the struct module? If I simply use >> packed = struct.pack(fmt, hello) in the code below (and 'hello' is a >> unicode string), I get this: "error: argument for 's' must be a string". I >> keep reading that I have to encode it to a utf-8 bytestring, but this does >> not work (it yields mojibake and tofu output for some of the languages). > >You keep reading it because it is the right approach. You will not get >mojibake if you decode the "packed" data before using it. > >Your code basically becomes > >for greet in greetings: > language, chars, encoding = greet > hello = "".join([unichr(i) for i in chars]) > packed = hello.encode("utf-8") > unpacked = packed.decode("utf-8") > print unpacked > >I don't know why you mess with byte order, perhaps you can tell a bit about >your actual use-case. Hi Peter, Thanks for helping me. I am writing binary files and I wanted to create test data for this. --this has been a good test case, such that (a) it demonstrated a defect in my program (b) idem, my knowledge. I realize how cp2152-ish I am; for instance, I wrongly tend to assume that len(someUnicodeString) == nbytes_of_that_unicode_string. --re: messing with byte order: I read in M. Summerfield's "Programming in Python 3" that it's advisable to always specify the byte order, for portability of the data. But, now that you mention it, the way I did it, I might as well omit it. Or, given that the binary format I am writing contains information about the byte order, I might hard-code the byte order (e.g. always write LE). That would follow Mark Summerfield's advise, if I understand it correctly. --(Aside from your advise to use utf-8) Given that sys.maxunicode == 65535 on my system (ie, that many unicode points can be represented in my compilation of Python) I'd expect that I not only could write u'blaah'.encode("unicode-internal"), but also u'blaah'.encode("ucs-2") Traceback (most recent call last): File "", line 1, in u'blaah'.encode("ucs-2") LookupError: unknown encoding: ucs-2 Why is the label "unicode-internal" to indicate both ucs-2 and ucs-4? And why does the same Python version on my Linux computer use 1114111 code points? Can we conclude that Linux users are better equiped to write a letter in Birmese or Aleut? ;-) Thanks again! Regards, Albert-Jan ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to struct.pack a unicode string?
> > * some encodings are more compact than others (e.g. Latin-1 uses > one byte per character, while UTF-32 uses four bytes per > character). I read that performance of UTF32 is better ("UTF-32 advantage: you don't need to decode stored data to the 32-bit Unicode code point for e.g. character by character handling. The code point is already available right there in your array/vector/string."). http://stackoverflow.com/questions/496321/utf8-utf16-and-utf32 But given that utf-32 is a memory hog, should one conclude that it's usually not a good idea to use it (esp. in Python)? >> but this does not work (it yields mojibake and tofu output for >> some of the languages). > > It would be useful to see an example of this. > > But if you do your encoding/decoding correctly, using the right > codecs, you should never get mojibake. You only get that when > you have a mismatch between the encoding you think you have and > the encoding you actually have. > > >> It's annoying if one needs to know the encoding in which each >> individual language should be represented. I was hoping >> "unicode-internal" was the way to do it, but this does not >> reproduce the original string when I unpack it.. :-( > > Yes, encodings are annoying. The sooner that all encodings other > than UTF-8 and UTF-32 disappear the better :) So true ;-) > The beauty of using UTF-8 instead of one of the many legacy > encodings is that UTF-8 can represent any character, so you don't > need to care about the individual language, and it is compact (at > least for Western European languages). Later you write "You need a variable-length struct, of course.". Is this because ASCII is a subset of UTF-8? The thing is, the the binary format I am writing (spss .sav), uses *fixed* column widths. This means that, even when I only use the ascii subset of utf-8, I still need to assume the worst-case-scenario, namely 3 bytes per symbol, right? > Why are you using struct for this? If you want to convert Unicode > strings into a sequence of bytes, that's exactly what the encode > method does. There's no need for struct. I am using struct to read/write binary data. I created the ' greetings' code to test my program (and my knowledge). As I said to Peter Otten, both were/are imperfect ;-). Struct needs a bytestring, not a unicode string, hence I needed to convert my unicode strings first. I used these languages because I suspected I often get away with errors because 'my' encoding (cp1252) is fairly easy. > greetings = [ > ('Arabic', > u'\u0627\u0644\u0633\u0644\u0627\u0645\u0020\u0639\u0644\u064a\u0643\u0645', > 'cp1256'), > ('Assamese', > u'\u09a8\u09ae\u09b8\u09cd\u0995\u09be\u09f0', > 'utf-8'), > ('Bengali', > u'\u0986\u09b8\u09b8\u09be\u09b2\u09be\u09ae\u09c1 > \u0986\u09b2\u09be\u0987\u0995\u09c1\u09ae', > 'utf-8'), > ('English', u'Greetings and salutations', > 'ascii'), > ('Georgian', > u'\u10d2\u10d0\u10db\u10d0\u10e0\u10ef\u10dd\u10d1\u10d0', > 'utf-8'), > ('Kazakh', > u'\u0421\u04d9\u043b\u0435\u043c\u0435\u0442\u0441\u0456\u0437 > \u0431\u0435', 'utf-8'), > ('Russian', > u'\u0417\u0434\u0440\u0430\u0432\u0441\u0442\u0432\u0443\u0439\u0442\u0435', > 'utf-8'), > ('Spanish', u'\xa1Hola!', 'cp1252'), > ('Swiss German', u'Gr\xfcezi', 'cp1252'), > ('Thai', > u'\u0e2a\u0e27\u0e31\u0e2a\u0e14\u0e35', > 'cp874'), > ('Walloon', u'Bondjo\xfb', 'cp1252'), > ] > for language, greet, encoding in greetings: > print u"Hello in %s: %s" % (language, greet) > for enc in ('utf-8', 'utf-16', 'utf-32', encoding): > bytestring = greet.encode(enc) > print "encoded as %s gives %r" % (enc, bytestring) > if bytestring.decode(enc) != greet: > print "*** round-trip encoding/decoding failed ***" > > > Any of the byte strings can then be written directly to a file: > > f.write(bytestring) > > or embedded into a struct. You need a variable-length struct, of course. See above. I believe I've got it working for character data already; now I still need to check whether I can also store e.g. Chinese metadata in my spss file. > My advice: stick to Python unicode strings internally, and always write > them to files as UTF-8. Thanks Steven, I appreciate it! ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to struct.pack a unicode string?
> to make is that the transform formats are multibyte encodings (except > ASCII in UTF-8), which means the expression str(len(hello)) is using > the wrong length; it needs to use the length of the encoded string. > Also, UTF-16 and UTF-32 typically have very many null bytes. Together, > these two observations explain the error: "unicode_internal' codec > can't decode byte 0x00 in position 12: truncated input". Hi Eryksun, Observation #1: Yes, makes perfect sense. I should have thought about that. Observation #2: As I emailed earlier today to Peter Otten, I thought unicode_internal means UCS-2 or UCS-4, depending on the size of sys.maxunicode? How is this related to UTF-16 and UTF-32? Thank you! Best regards, Albert-Jan ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to struct.pack a unicode string?
On 12/02/2012 08:34 AM, Albert-Jan Roskam wrote: > > > > > > > Hi Eryksun, > > Observation #1: Yes, makes perfect sense. I should have thought about that. > Observation #2: > As I emailed earlier today to Peter Otten, I thought unicode_internal means > UCS-2 or UCS-4, > depending on the size of sys.maxunicode? How is this related to UTF-16 and > UTF-32? How is maxunicode relevant? Are you stuck on 3.2 or something? Python 3.3 uses 1 byte, 2 bytes or 4 for internal storage of a string depending only upon the needs of that particular string. -- DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] FW: (no subject)
Luke, Thanks. The generator syntax is really cool. -- Ashfaq ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] Help with writing a program
Hello, I am trying to write a program which takes two lines of input, one called "a", and one called "b", which are both strings, then outputs the number of times a is a substring of b. If you could give me an algorithm/pseudo code of what I should do to create this program, I would greatly appreciate that. Thank you for using your time to consider my request. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Help with writing a program
On 03/12/2012 03:59, rajesh mullings wrote: Hello, I am trying to write a program which takes two lines of input, one called "a", and one called "b", which are both strings, then outputs the number of times a is a substring of b. If you could give me an algorithm/pseudo code of what I should do to create this program, I would greatly appreciate that. Thank you for using your time to consider my request. Start here http://docs.python.org/2/library/string.html -- Cheers. Mark Lawrence. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] reverse diagonal
On Sun, Dec 2, 2012 at 2:32 AM, Steven D'Aprano wrote: > >> ~i returns the value (-i - 1): > > Assuming certain implementation details about how integers are stored, > namely that they are two-compliment rather than one-compliment or > something more exotic. Yes, the result is platform dependent, at least for the 2.x int type. I saw it in someone else's code or blog a while ago and thought I'd pass it along as a novelty and something to keep an eye out for. A multiprecision long might qualify as exotic. It uses sign-magnitude form. The sign of the number and the length of ob_digit are both stored in ob_size. For the invert op, it adds 1 and negates the sign to emulate 2's complement: http://hg.python.org/cpython/file/70274d53c1dd/Objects/longobject.c#l3566 Further along is more 2's complement emulation for bitwise &, |, and ^: http://hg.python.org/cpython/file/70274d53c1dd/Objects/longobject.c#l3743 > Okay, just about every computer made since 1960 uses two-compliment > integers, but still, the effect of ~i depends on the way integers are > represented internally rather than some property of integers as an > abstract number. That makes it a code smell. It relies on integer modulo arithmetic. The internal base is arbitrary and not apparent. It could be 10s complement on some hypothetical base 10 computer. In terms of a Python sequence, you could use unsigned indices such as [0,1,2,3,4,5,6,7] or the N=8 complement indices [0,1,2,3,-4,-3,-2,-1], where -1 % 8 == 7, and so on. The invert op can be generalized as N-1-i for any N-length window on the integers (e.g. 5-digit base 10, where N=10**5, subtract i from N-1 == 9), which just inverts the sequence order. The interpretation of this as negative number depends on a signed type that represents negative values as modulo N. That's common because it's a simple shift of the window to be symmetric about 0 (well, almost symmetric for even N); the modulo arithmetic is easy and there's no negative 0. However, with a multiprecision integer type, it's simpler to use a sign magnitude representation. That said, I don't want to give the impression that I disagree with you. You're right that it isn't generally advisable to use a single operation instead of two or three if it sacrifices clarity and portability. It didn't jump out at me as a problem since I take 2s complement for granted and have a bias to favor symmetry and minimalism. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Help with writing a program
The Python version, is Python 3. On Sun, Dec 2, 2012 at 10:59 PM, rajesh mullings wrote: > Hello, I am trying to write a program which takes two lines of input, one > called "a", and one called "b", which are both strings, then outputs the > number of times a is a substring of b. If you could give me an > algorithm/pseudo code of what I should do to create this program, I would > greatly appreciate that. Thank you for using your time to consider my > request. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to struct.pack a unicode string?
On Sun, Dec 2, 2012 at 8:34 AM, Albert-Jan Roskam wrote: > > As I emailed earlier today to Peter Otten, I thought unicode_internal means > UCS-2 or UCS-4, depending on the size of sys.maxunicode? How is this related > to UTF-16 and UTF-32? UCS is the universal character set. Some highlights of the Basic Multilingual Plane (BMP): U+-U+00FF is Latin-1 (including the C0 and C1 control codes). U+D800-U+DFFF is reserved for UTF-16 surrogate pairs. U+E000-U+F8FF is reserved for private use. Most of U+F900-U+ is assigned. Notably U+FEFF (zero width no-break space) doubles as the BOM/signature in the transformation formats. UTF-16 encodes the supplementary planes by using 2 codes as a surrogate pair. This uses a reserved 11-bit block (U+D800-U+DFFF), which is split into two 10-bit ranges: U+D800-U+DBFF for the lead surrogate and U+DC00-U+DFFF for the trail surrogate. Together that's the required 20 bits for the 16 supplementary planes. Including the BMP, this scheme covers the complete UCS range of 17 * 2**16 == 1114112 codes (on a wide build, that's sys.maxunicode + 1). For encoding text, use one of the transformation formats such as UTF-8, UTF-16, or UTF-32. Unless you have a requirement to use UTF-16 or UTF-32, it's best to stick to encoding to UTF-8. It's the default encoding in 3.x. It's also generally the most compact representation (especially if there's a lot of ASCII) and compatible with null-terminated byte strings (i.e. C array of char, terminated by NUL). Regardless of narrow vs wide build, you can always encode to one of these formats. The encoders for UTF-8 and UTF-32 first recombine any surrogate pairs in the internal representation. CPython 3.3 has a new implementation that angles for the best of all worlds, opting for a 1-byte, 2 byte, or 4-byte representation depending on the maximum code in the string. The internal representation doesn't use surrogates, so there's no more narrow vs wide build distinction. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] FW: (no subject)
On Sun, Dec 2, 2012 at 8:41 PM, Ashfaq wrote: > Luke, > > Thanks. The generator syntax is really cool. > I misspoke, the correct term is "list comprehension". A generator is something totally different! Sorry about the confusion, my fault. I type too fast sometimes :) Glad you liked it though. -Luke ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Help with writing a program
There is an equivalent page in the documentation for Python 3 as well, regarding strings. This sounds a lot like a homework problem so you are unlikely to get a lot of help. You certainly won't get exact code. What have you tried so far? Where are you getting stuck? We're not here to write code for you, this list is meant to help you learn something yourself. If you just want someone to write code for you there are plenty of sites that will do that. But if you want to figure it out I'd be happy to give you some hints if I can see that you're making some effort. One effort you could make would be to find the relevant Python 3 document discussing strings and check if it has some references to finding substrings. Let me know what you try and I'll help you if you get stuck. Thanks, -Luke On Sun, Dec 2, 2012 at 11:31 PM, fantasticrm wrote: > The Python version, is Python 3. > > > On Sun, Dec 2, 2012 at 10:59 PM, rajesh mullings wrote: > >> Hello, I am trying to write a program which takes two lines of input, one >> called "a", and one called "b", which are both strings, then outputs the >> number of times a is a substring of b. If you could give me an >> algorithm/pseudo code of what I should do to create this program, I would >> greatly appreciate that. Thank you for using your time to consider my >> request. > > > > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor > > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor