AW: Working with files of different character encodings

Thomas Bätzler Tue, 06 Apr 2010 01:43:49 -0700

Doug Cacialli <[email protected]> asked:
> Does anyone have any ideas how I can make the second block of code
> work?  Or otherwise accomplish the task without opening the .txt file
> twice?


How large are your data files? If your available memory is much larger than 
your maximum file size, you might get away with slurping the file into a scalar 
and then convert its encoding if needed, possibly like this:

#!/usr/bin/perl -w

use strict;
use Encode;

my $file = 'test.txt';

open( my $fh, '<', $file ) or die "Can't open '$file': $!";

my $data = do {
  local $/ = undef;
  <$fh>;
};

close( $fh );

if( $data =~ m/^\xff\xfe/ || $data =~ m/^\xfe\xff/ ){
  print "input is UTF-16 w/ BOM\n";
  $data = decode('utf-16',$data);
} elsif( $data =~ m/^[^\x00]\x00/ ){
  print "input is probably little-endian utf-16 w/o BOM\n";
  $data = "\xff\xfe" . $data;
  $data = decode('utf-16',$data);
} elsif( $data =~ m/^\x00[^\x00]/ ){
  print "input is probably big-endian utf-16 w/o BOM\n";
  $data = "\xfe\xff" . $data;
  $data = decode('utf-16',$data);
}

chomp( $data);

my @words = split /\s+/, $data;

print "input file has" . scalar( @words ) . " words\n";

__END__

HTH,
Thomas

--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/

AW: Working with files of different character encodings

Reply via email to