> On Jan 24, 2018, at 11:14 AM, rhkra...@gmail.com wrote: > > This is OT, but I thought I'd start with this list as it is the list that I > deal with more than any other. If no one here can help, suggestions for a > better list to try will be appreciated. >
I used to subscribe to Perl Beginners, but the administrator got draconian about discussing other languages, I dropped, and now I appear to be banned: https://lists.perl.org/list/beginners.html > I've never used Perl, but I'm hoping Perl can do the job for me. > Modern Perl (version 5.10 and up) is a UTF-8 compliant, general purpose language, but much of software today is closed and baroque. You want to use whatever language the database designers had in mind. > What I need to do: > > I have multiple large files (one example is 5.4 MB). It is essentially a > data > dump from a database--I have no control over the database or the format ofe > dump. > > The file is ugly, with lots of extraneous characters--I want to run a series > of > regular expression search and replace commands over the file to clean it up. > > Some of the things that may make it tough: > > * In essence, there are no line breaks (0Ah) (or 0Dh)--in essence, there is > one long 5.4 MB line (well, there are 4 line breaks for some short lines at > the beginning of the file, maybe somewhere between 32 and 80 characters on > each > of those 4 lines. > > * The file can, and often will have UTF-8 characters in it (iiuc--the file > contains URLs, some of which, I'm sure, can include UTF-8 characters, or > maybe > some other encoding??). The search and replace doesn't particularly have to > handle the UTF-8 search terms (because the keywords and punctuation I will > search on will be plain ASCII), but any UTF-8 characters have to remain > "intact" after the search and replace. > > I'm hoping that I can write a Perl script that may be something like this: > > Code to open a file (which I will need to learn / find) > > Multiple statements of the form "s/<search regular expression>/<replace > regular expression/g > > (Aside, the replace probably doesn't have to be a regular expression, it will > need to include things like line break characters (\n).) > > I did try to do this with one of the editors I use (I started with Kate), but > kate breaks that 5.4 MB "line" into multiple lines of about 4096 bytes / > characters (at inconvenient places), and, although I got the job (almost) > done, it required a lot of manual intervention / correction, so I want to > automate it with a tool that can work on very long lines without inserting > line breaks (other than those I require). > > If some simpler tool can do the job, I'll consider that as well (I have > occasionally used awk, and maybe sed (I don't think sed ever proved useful > for > me). > > Any help appreciated. > If you attack the files with raw Perl, you're going to be writing a lexer and parser to read the database dump into a data structure, and then doing your work against that (perhaps by dumping it to a common format and then writing tools against that). If you don't have an EBNF grammar for the dump, you'll have to figure it out. Getting the lexer/ parser right, and verifying that you got it right, is going to be a *lot* of work. Your best bet is to: 1. Have the database administrator generate exports in a friendlier format, such as flat-file comma-seperated values, tab-seperated values, XML, etc.. 2. Get a tool that understands the dump file (such as the original database engine), import the dumps, and then generate queries/ reports/ etc. as desired to meet your needs. David