-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Jan 24, 2018 at 02:14:42PM -0500, rhkra...@gmail.com wrote:
> This is OT, but I thought I'd start with this list as it is the list that I 
> deal with more than any other.  If no one here can help, suggestions for a 
> better list to try will be appreciated.
> 
> I've never used Perl, but I'm hoping Perl can do the job for me.
> 
> What I need to do:
> 
> I have multiple large files (one example is 5.4 MB).  It is essentially a 
> data 
> dump from a database--I have no control over the database or the format ofe 
> dump.

Perl won't have a problem with a 5.4MB long line.

[...]

>    * The file can, and often will have UTF-8 characters in it (iiuc--the file 
> contains URLs, some of which, I'm sure, can include UTF-8 characters, or 
> maybe 
> some other encoding??).  The search and replace doesn't particularly have to 
> handle the UTF-8 search terms (because the keywords and punctuation I will 
> search on will be plain ASCII), but any UTF-8 characters have to remain 
> "intact" after the search and replace.

Now that's one thing: does the file just contain some UTF-8 characters,
or is it valid UTF-8? This is important to know, because then you can
decide whether to treat it as UTF-8 (then regexps will be OK) or as a
byte stream (then you'll "see" the UTF-8 sequences as single bytes:
there be dragons).

You can check that with

  iconv -f UTF-8 < your_file > /dev/null

or something similar

> I'm hoping that I can write a Perl script that may be something like this:
> 
> Code to open a file (which I will need to learn / find)

  open(my $fh, "<:encoding(UTF-8)", "your_file)

(the whole kaboodle in "perldoc -f open").


> Multiple statements of the form "s/<search regular expression>/<replace 
> regular expression/g

If you set $/ (the input record separator) to undef, you can slurp the
whole file into one variable, like so:

  $/ = undef;
  my $data = <$fh>;

(that narrative is in 'man perlvar', for the special variables).

> (Aside, the replace probably doesn't have to be a regular expression, it will 
> need to include things like line break characters (\n).)

The replace string isn't a regexp anyway (doesn't make sense :) -- it's just
a normal string, possibly with placeholders for parenthesized submatches from
the regexp (if that's mumbo jumbo for you, just ask). "\n" is just a normal
character, as is "\t", etc.

Since the whole ugly string will contain newlines, don't forget the /s
modifier, which tells the regexp machine to treat newlines as every
other character, like so:

  $data =~ s/tom/jerry/gs;

(the whole story is in 'man perlre').

At the end, you just print that:

  open(my $outfh, ">:encoding(UTF-8)", "your_output_file");
  print $outfh $data;

(no comma between the filehandle $outfh and $data)

> I did try to do this with one of the editors I use (I started with Kate), but 
> kate breaks that 5.4 MB "line" into multiple lines of about 4096 bytes / 
> characters (at inconvenient places) [...]

Yikes. Neither vim or Emacs will do that to you (although Emacs gets a bit
sluggish on MB-long lines). I'd put such an editor in the recycle bin (sorry).

[...]

> If some simpler tool can do the job, I'll consider that as well (I have 
> occasionally used awk, and maybe sed (I don't think sed ever proved useful 
> for 
> me).

Sed is actually pretty nifty, but gets some getting used to.

> Any help appreciated.

I hope that gets you started. Just ask.

Cheers
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlpo6r8ACgkQBcgs9XrR2kZ/oQCfdeDP0dugi4wFQZmjPc9FhIgz
ltEAn1Wonm+hhYQO1OMkl7X7p4jjBVBQ
=LOLL
-----END PGP SIGNATURE-----

Reply via email to