This is OT, but I thought I'd start with this list as it is the list that I 
deal with more than any other.  If no one here can help, suggestions for a 
better list to try will be appreciated.

I've never used Perl, but I'm hoping Perl can do the job for me.

What I need to do:

I have multiple large files (one example is 5.4 MB).  It is essentially a data 
dump from a database--I have no control over the database or the format ofe 
dump.

The file is ugly, with lots of extraneous characters--I want to run a series of 
regular expression search and replace commands over the file to clean it up.

Some of the things that may make it tough:

   * In essence, there are no line breaks (0Ah) (or 0Dh)--in essence, there is 
one long 5.4 MB line (well, there are 4 line breaks for some short lines at 
the beginning of the file, maybe somewhere between 32 and 80 characters on each 
of those 4 lines.

   * The file can, and often will have UTF-8 characters in it (iiuc--the file 
contains URLs, some of which, I'm sure, can include UTF-8 characters, or maybe 
some other encoding??).  The search and replace doesn't particularly have to 
handle the UTF-8 search terms (because the keywords and punctuation I will 
search on will be plain ASCII), but any UTF-8 characters have to remain 
"intact" after the search and replace.

I'm hoping that I can write a Perl script that may be something like this:

Code to open a file (which I will need to learn / find)

Multiple statements of the form "s/<search regular expression>/<replace 
regular expression/g

(Aside, the replace probably doesn't have to be a regular expression, it will 
need to include things like line break characters (\n).)

I did try to do this with one of the editors I use (I started with Kate), but 
kate breaks that 5.4 MB "line" into multiple lines of about 4096 bytes / 
characters (at inconvenient places), and, although I got the job (almost) 
done, it required a lot of manual intervention / correction, so I want to 
automate it with a tool that can work on very long lines without inserting 
line breaks (other than those I require).

If some simpler tool can do the job, I'll consider that as well (I have 
occasionally used awk, and maybe sed (I don't think sed ever proved useful for 
me).

Any help appreciated.

Reply via email to