Re: extremely slow text processing on Redhat 9

Cameron Simpson Mon, 23 Jun 2003 20:50:13 -0700

On 21:20 23 Jun 2003, Bret Hughes <[EMAIL PROTECTED]> wrote:
| > Yep. It's the UTF-8 stuff, which is inherently more expensive to parse.
| > Set your locale to "C" thus:
| >     export LC_ALL=C
| > and try your tests again.
| > 
| > There has been some progress with this. See here:
| >     https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=82032
| 
| That bugzilla thread does sound promising.  I have held off trying to
| undo the utf stuff at a sytem level even though we are all english
| readers here because I figured that other stuff will break in the
| future.  Ids that the case or can I simple set LANG, LC_ALL et al in
| etc/sysconfig and be done with it?


Well, I've set the C locale in our users' default environment with no
reported ill-effects. I would think most "system" things should work
that way as well, as most need for something more useful than the ASCII
set is at the user end, where users not in our little "English" niche
can set things as needed.

BTW, the locale stuff also affects things like "ls" listings and
(ouch!) shell globbing. No more does [a-z]* do what you might expect
because outside the "C" local the collation sequence includes the
uppercase letters.

But I can see that it should be something of a transition step.  The
coding set for XML and modern HTML is officially UTF-8.  Hopefully parsers
for these things don't depend on the environment setting (they shouldn't).

But as this gets more widespread, you _are_ going to find yourself
grepping nonASCII things more often, and we are going to have to take
the hit UTF-8 costs in speed. As the bug thread shows, the pathological
performance you can see can fall out of poor code, but UTF-8 is never
going to be as fast as single-octet encodings, both because of the extra
data and also the byte-sequence parsing that has to take place.

My main motivation for using the "C" locale is twofold: so my scripts
don't break (i.e. [a-z]* behaving as expected when the script was written,
back in the Dark Ages) and so that the many apps that aren't wide-char
and locale aware keep running. Eg the threads on Acrobat needing "C"
locale etc. I expect this legacy need to be gone in two years, hopefully
a little less.

Frankly, I welcome UTF-8. No more stupid "what encoding is this text file"
bugs, etc.

Cheers,
--
Cameron Simpson <[EMAIL PROTECTED]> DoD#743
http://www.cskk.ezoshosting.com/cs/

A program in conformance will not tend to stay in conformance, because even if
it doesn't change, the standard will.   - Norman Diamond <[EMAIL PROTECTED]>


-- 
redhat-list mailing list
unsubscribe mailto:[EMAIL PROTECTED]
https://www.redhat.com/mailman/listinfo/redhat-list

Re: *extremely slow* text processing on Redhat 9

Reply via email to

Re: extremely slow text processing on Redhat 9