Why isn't my re.sub replacing the contents of my MS Word file?
Hi, here is a snippet of code that opens a file (fn contains the path\name) and first tried to replace all endash, emdash etc characters with simple dash characters, before doing a search. But the replaces are not having any effect. Obviously a syntax problemwwhat silly thing am I doing wrong? Thanks! fn = 'z:\Documentation\Software' def processdoc(fn,outfile): fStr = open(fn, 'rb').read() re.sub(b'‒','-',fStr) re.sub(b'–','-',fStr) re.sub(b'—','-',fStr) re.sub(b'―','-',fStr) re.sub(b'⸺','-',fStr) re.sub(b'⸻','-',fStr) re.sub(b'-','-',fStr) re.sub(b'','-',fStr) -- https://mail.python.org/mailman/listinfo/python-list
Re: Why isn't my re.sub replacing the contents of my MS Word file?
> > re.sub _returns_ its result (strings are immutable). Ahhso I tried this for each re.sub fStr = re.sub(b'‒','-',fStr) No errors running it, but it still does nothing. -- https://mail.python.org/mailman/listinfo/python-list
Re: Why isn't my re.sub replacing the contents of my MS Word file?
On Friday, May 9, 2014 4:09:58 PM UTC-4, Tim Chase wrote: > A Word doc (as your subject mentions) is a binary format. There's > the older .doc and the newer .docx (which is actually a .zip file > with a particular content-structure renamed to .docx). > I am using .doc files only.. > > For the older .doc file, it's a binary format, so even if you can > successfully find & swap out sequences of 7 chars for a single char, > it might screw up the internal offsets, breaking your file. I do not save the file out again, only try to change all en-dash and em-dash to dashes, then search and print things to another file, closing the searched file without writing it. > > Additionally, I vaguely remember sparring with them using some 16-bit > wide characters in .doc files so you might have to search for > atrocious things like b"\x00&\x00#\x00x\x002\x000\x001\x002" (each > character being prefixed with "\x00". Hmmm..thought that was what I was doing. Can anyone figure out why the syntax is wrong for Word 2007 document binary file data? -- https://mail.python.org/mailman/listinfo/python-list
Re: Why isn't my re.sub replacing the contents of my MS Word file?
On Friday, May 9, 2014 8:12:57 PM UTC-4, Steven D'Aprano wrote: > Good: > > > > fStr = re.sub(b'‒', b'-', fStr) > Doesn't work...the document has been verified to contain endash and emdash characters, but this does NOT replace them. > > > Better: > > > > fStr = fStr.replace(b'‒', b'-') > > Still doesn't work > > > > But having said that, you actually can make use of the nuclear-powered > > bulldozer, and do all the replacements in one go: > > > > Best: > > > > # Untested > > fStr = re.sub(b'(201[2-5])|(2E3[AB])|(00[2A]D)', b'-', fStr) Still doesn't work. Guess whatever the code is for endash and mdash are not the ones I am using -- https://mail.python.org/mailman/listinfo/python-list
Re: Why isn't my re.sub replacing the contents of my MS Word file?
On Tuesday, May 13, 2014 9:49:12 AM UTC-4, Steven D'Aprano wrote: > > You may have missed my follow up post, where I said I had not noticed you > were operating on a binary .doc file. > > If you're not willing or able to use a full-blown doc parser, say by > controlling Word or LibreOffice, the other alternative is to do something > quick and dirty that might work most of the time. Open a doc file, or > multiple doc files, in a hex editor and *hopefully* you will be able to > see chunks of human-readable text where you can identify how en-dashes > and similar are stored. I created a .doc file and opened it with UltraEdit in binary (Hex) mode. What I see is that there are two characters, one for ndash and one for mdash, each a single byte long. 0x96 and 0x97. So I tried this: fStr = re.sub(b'\0x96',b'-',fStr) that did nothing in my file. So I tried this: fStr = re.sub(b'0x97',b'-',fStr) which also did nothing. So, for fun I also tried to just put these wildcards in my re.findall so I added |Part \0x96|Part \0x97to no avail. Obviously 0x96 and 0x97 are NOT being interpreted in a re.findall or re.sub as hex byte values of 96 and 97 hexadecimal using my current syntax. So here's my question...if I want to replace all ndash or mdash values with regular '-' symbols using re.sub, what is the proper syntax to do so? Thanks! -- https://mail.python.org/mailman/listinfo/python-list
Re: Why isn't my re.sub replacing the contents of my MS Word file?
On Tuesday, May 13, 2014 4:26:51 PM UTC-4, MRAB wrote: > > 0x96 is a hexadecimal literal for an int. Within a string you need \x96 > > (it's \x for 2 hex digits, \u for 4 hex digits, \U for 8 hex digits). Yes, that was my problem. Figured it out just after posting my last message. using \x96 works correctly. Thanks! -- https://mail.python.org/mailman/listinfo/python-list
