-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, Jan 24, 2018 at 02:14:42PM -0500, rhkra...@gmail.com wrote: > This is OT, but I thought I'd start with this list as it is the list that I > deal with more than any other. If no one here can help, suggestions for a > better list to try will be appreciated. > > I've never used Perl, but I'm hoping Perl can do the job for me. > > What I need to do: > > I have multiple large files (one example is 5.4 MB). It is essentially a > data > dump from a database--I have no control over the database or the format ofe > dump.
Perl won't have a problem with a 5.4MB long line. [...] > * The file can, and often will have UTF-8 characters in it (iiuc--the file > contains URLs, some of which, I'm sure, can include UTF-8 characters, or > maybe > some other encoding??). The search and replace doesn't particularly have to > handle the UTF-8 search terms (because the keywords and punctuation I will > search on will be plain ASCII), but any UTF-8 characters have to remain > "intact" after the search and replace. Now that's one thing: does the file just contain some UTF-8 characters, or is it valid UTF-8? This is important to know, because then you can decide whether to treat it as UTF-8 (then regexps will be OK) or as a byte stream (then you'll "see" the UTF-8 sequences as single bytes: there be dragons). You can check that with iconv -f UTF-8 < your_file > /dev/null or something similar > I'm hoping that I can write a Perl script that may be something like this: > > Code to open a file (which I will need to learn / find) open(my $fh, "<:encoding(UTF-8)", "your_file) (the whole kaboodle in "perldoc -f open"). > Multiple statements of the form "s/<search regular expression>/<replace > regular expression/g If you set $/ (the input record separator) to undef, you can slurp the whole file into one variable, like so: $/ = undef; my $data = <$fh>; (that narrative is in 'man perlvar', for the special variables). > (Aside, the replace probably doesn't have to be a regular expression, it will > need to include things like line break characters (\n).) The replace string isn't a regexp anyway (doesn't make sense :) -- it's just a normal string, possibly with placeholders for parenthesized submatches from the regexp (if that's mumbo jumbo for you, just ask). "\n" is just a normal character, as is "\t", etc. Since the whole ugly string will contain newlines, don't forget the /s modifier, which tells the regexp machine to treat newlines as every other character, like so: $data =~ s/tom/jerry/gs; (the whole story is in 'man perlre'). At the end, you just print that: open(my $outfh, ">:encoding(UTF-8)", "your_output_file"); print $outfh $data; (no comma between the filehandle $outfh and $data) > I did try to do this with one of the editors I use (I started with Kate), but > kate breaks that 5.4 MB "line" into multiple lines of about 4096 bytes / > characters (at inconvenient places) [...] Yikes. Neither vim or Emacs will do that to you (although Emacs gets a bit sluggish on MB-long lines). I'd put such an editor in the recycle bin (sorry). [...] > If some simpler tool can do the job, I'll consider that as well (I have > occasionally used awk, and maybe sed (I don't think sed ever proved useful > for > me). Sed is actually pretty nifty, but gets some getting used to. > Any help appreciated. I hope that gets you started. Just ask. Cheers - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAlpo6r8ACgkQBcgs9XrR2kZ/oQCfdeDP0dugi4wFQZmjPc9FhIgz ltEAn1Wonm+hhYQO1OMkl7X7p4jjBVBQ =LOLL -----END PGP SIGNATURE-----