Why isn't my re.sub replacing the contents of my MS Word file?

2014-05-09 Thread scottcabit
Hi,

 here is a snippet of code that opens a file (fn contains the path\name) and 
first tried to replace all endash, emdash etc characters with simple dash 
characters, before doing a search.
  But the replaces are not having any effect. Obviously a syntax 
problemwwhat silly thing am I doing wrong?

  Thanks!

fn = 'z:\Documentation\Software'
def processdoc(fn,outfile):
fStr = open(fn, 'rb').read()
re.sub(b'‒','-',fStr)
re.sub(b'–','-',fStr)
re.sub(b'—','-',fStr)
re.sub(b'―','-',fStr)
re.sub(b'⸺','-',fStr)
re.sub(b'⸻','-',fStr)
re.sub(b'-','-',fStr)
re.sub(b'­','-',fStr)

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Why isn't my re.sub replacing the contents of my MS Word file?

2014-05-09 Thread scottcabit
> 
> re.sub _returns_ its result (strings are immutable).

  Ahhso I tried this for each re.sub

  fStr = re.sub(b'‒','-',fStr)

  No errors running it, but it still does nothing.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Why isn't my re.sub replacing the contents of my MS Word file?

2014-05-09 Thread scottcabit
On Friday, May 9, 2014 4:09:58 PM UTC-4, Tim Chase wrote:

> A Word doc (as your subject mentions) is a binary format.  There's
> the older .doc and the newer .docx (which is actually a .zip file
> with a particular content-structure renamed to .docx).
> 
   I am using .doc files only..

> 
> For the older .doc file, it's a binary format, so even if you can
> successfully find & swap out sequences of 7 chars for a single char,
> it might screw up the internal offsets, breaking your file.

   I do not save the file out again, only try to change all en-dash and em-dash 
to dashes, then search and print things to another file, closing the searched 
file without writing it.

> 
> Additionally, I vaguely remember sparring with them using some 16-bit
> wide characters in .doc files so you might have to search for
> atrocious things like b"\x00&\x00#\x00x\x002\x000\x001\x002" (each
> character being prefixed with "\x00".

  Hmmm..thought that was what I was doing. Can anyone figure out why the syntax 
is wrong for Word 2007 document binary file data?

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Why isn't my re.sub replacing the contents of my MS Word file?

2014-05-12 Thread scottcabit
On Friday, May 9, 2014 8:12:57 PM UTC-4, Steven D'Aprano wrote:

> Good:
> 
> 
> 
> fStr = re.sub(b'‒', b'-', fStr)
> 

  Doesn't work...the document has been verified to contain endash and emdash 
characters, but this does NOT replace them.
> 
> 
> Better:
> 
> 
> 
> fStr = fStr.replace(b'‒', b'-')
> 
> 
   Still doesn't work
> 
> 
> 
> But having said that, you actually can make use of the nuclear-powered 
> 
> bulldozer, and do all the replacements in one go:
> 
> 
> 
> Best:
> 
> 
> 
> # Untested
> 
> fStr = re.sub(b'&#x(201[2-5])|(2E3[AB])|(00[2A]D)', b'-', fStr)

  Still doesn't work.

  Guess whatever the code is for endash and mdash are not the ones I am 
using

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Why isn't my re.sub replacing the contents of my MS Word file?

2014-05-13 Thread scottcabit
On Tuesday, May 13, 2014 9:49:12 AM UTC-4, Steven D'Aprano wrote:
> 
> You may have missed my follow up post, where I said I had not noticed you 
> were operating on a binary .doc file.
> 
> If you're not willing or able to use a full-blown doc parser, say by 
> controlling Word or LibreOffice, the other alternative is to do something 
> quick and dirty that might work most of the time. Open a doc file, or 
> multiple doc files, in a hex editor and *hopefully* you will be able to 
> see chunks of human-readable text where you can identify how en-dashes 
> and similar are stored.

  I created a .doc file and opened it with UltraEdit in binary (Hex) mode. What 
I see is that there are two characters, one for ndash and one for mdash, each a 
single byte long. 0x96 and 0x97.
  So I tried this: fStr = re.sub(b'\0x96',b'-',fStr)

  that did nothing in my file. So I tried this: fStr = re.sub(b'0x97',b'-',fStr)

  which also did nothing.
  So, for fun I also tried to just put these wildcards in my re.findall so I 
added |Part \0x96|Part \0x97to no avail.

  Obviously 0x96 and 0x97 are NOT being interpreted in a re.findall or re.sub 
as hex byte values of 96 and 97 hexadecimal using my current syntax.

  So here's my question...if I want to replace all ndash  or mdash values with 
regular '-' symbols using re.sub, what is the proper syntax to do so?

  Thanks!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Why isn't my re.sub replacing the contents of my MS Word file?

2014-05-14 Thread scottcabit
On Tuesday, May 13, 2014 4:26:51 PM UTC-4, MRAB wrote:
> 
> 0x96 is a hexadecimal literal for an int. Within a string you need \x96
> 
> (it's \x for 2 hex digits, \u for 4 hex digits, \U for 8 hex digits).

  Yes, that was my problem. Figured it out just after posting my last message. 
using \x96 works correctly. Thanks!


-- 
https://mail.python.org/mailman/listinfo/python-list