On 21:20 23 Jun 2003, Bret Hughes <[EMAIL PROTECTED]> wrote: | > Yep. It's the UTF-8 stuff, which is inherently more expensive to parse. | > Set your locale to "C" thus: | > export LC_ALL=C | > and try your tests again. | > | > There has been some progress with this. See here: | > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=82032 | | That bugzilla thread does sound promising. I have held off trying to | undo the utf stuff at a sytem level even though we are all english | readers here because I figured that other stuff will break in the | future. Ids that the case or can I simple set LANG, LC_ALL et al in | etc/sysconfig and be done with it?
Well, I've set the C locale in our users' default environment with no reported ill-effects. I would think most "system" things should work that way as well, as most need for something more useful than the ASCII set is at the user end, where users not in our little "English" niche can set things as needed. BTW, the locale stuff also affects things like "ls" listings and (ouch!) shell globbing. No more does [a-z]* do what you might expect because outside the "C" local the collation sequence includes the uppercase letters. But I can see that it should be something of a transition step. The coding set for XML and modern HTML is officially UTF-8. Hopefully parsers for these things don't depend on the environment setting (they shouldn't). But as this gets more widespread, you _are_ going to find yourself grepping nonASCII things more often, and we are going to have to take the hit UTF-8 costs in speed. As the bug thread shows, the pathological performance you can see can fall out of poor code, but UTF-8 is never going to be as fast as single-octet encodings, both because of the extra data and also the byte-sequence parsing that has to take place. My main motivation for using the "C" locale is twofold: so my scripts don't break (i.e. [a-z]* behaving as expected when the script was written, back in the Dark Ages) and so that the many apps that aren't wide-char and locale aware keep running. Eg the threads on Acrobat needing "C" locale etc. I expect this legacy need to be gone in two years, hopefully a little less. Frankly, I welcome UTF-8. No more stupid "what encoding is this text file" bugs, etc. Cheers, -- Cameron Simpson <[EMAIL PROTECTED]> DoD#743 http://www.cskk.ezoshosting.com/cs/ A program in conformance will not tend to stay in conformance, because even if it doesn't change, the standard will. - Norman Diamond <[EMAIL PROTECTED]> -- redhat-list mailing list unsubscribe mailto:[EMAIL PROTECTED] https://www.redhat.com/mailman/listinfo/redhat-list