Geir, I looked through some scripts that I wrote to help me sync the GNU Nano repository and I came across a Perl script that might be useful to you in quickly identifying all log messages that are not representable in ASCII (hence possibly not UTF-8).
Attached is the source of the script. To use it, you will need the libsvn Perl bindings (on Debian, install the `libsvn-perl` package), and you will need to edit line 20 to change the URL of the Subversion repository that you wish to examine. Example output for svn://svn.sv.gnu.org/nano is: ------------------------------------------------------------------------ r619 Added Galician translation by Jacobo Tarr<jtar...@trasno.net>. ------------------------------------------------------------------------ r757 Updated Galician translation; thanks, Jacobo Tarr ------------------------------------------------------------------------ r826 Galician translation brought up to date for 1.1.2 by Jacobo Tarr ------------------------------------------------------------------------ r954 Galician translation update (Jacobo Tarr. ------------------------------------------------------------------------ r958 French translation update (Jean-Philippe Gu곡rd). ------------------------------------------------------------------------ r962 French translation update (Jean-Philippe Gu곡rd). ------------------------------------------------------------------------ r1009 Moved no.po to nn.po. New Norwegian bokm欠translation, by Stig E Sandoe <s...@ii.uib.no>. Updated Norwegian nynorsk translation, by Kjetil Torgrim Homme <kjeti...@linpro.no>. ------------------------------------------------------------------------ r1013 Moved no.po to nn.po. New Norwegian bokm欠translation, by Stig E sand𠼳...@users.sourceforge.net>. Added missing entries to THANKS. ------------------------------------------------------------------------ r1047 French translation updates (Jean-Philippe Gu곡rd). ------------------------------------------------------------------------ r1070 Norwegian bokm欠translation updates (Stig E Sandoe). ------------------------------------------------------------------------ r1071 Norwegian bokm欠translation updates (Stig E Sand𩮍 ------------------------------------------------------------------------ r1072 Norwegian bokm欠translation updates (Stig E Sand𩮍 ------------------------------------------------------------------------ r1125 French translation updates (Jean-Philippe Gu곡rd). ------------------------------------------------------------------------ r1133 French translation updates (Jean-Philippe Gu곡rd). ------------------------------------------------------------------------ r1258 French translation update (Jean-Philippe Gu곡rd). ------------------------------------------------------------------------ r1259 Spanish translation updates (Ricardo Javier Cⳤenes Medina). ------------------------------------------------------------------------ r1299 Updated Spanish translation (Ricardo Javier Cⳤenes Medina). ------------------------------------------------------------------------ r1301 Updated French translation (Jean-Philippe Gu곡rd). ------------------------------------------------------------------------ r1500 Updated French translation by Jean-Philippe Gu곡rd. ------------------------------------------------------------------------ r1537 Updated French translation by Jean-Philippe Gu곡rd. ------------------------------------------------------------------------ r1923 Updated French translation by Jean-Philippe Guérard. ------------------------------------------------------------------------ r2102 spell Ulf H峮hammar's name right ------------------------------------------------------------------------ r2373 in do_credits(), display Florian König's name properly in UTF-8 mode; since we can't dynamically set that element of the array to its UTF-8 equivalent when in UTF-8 mode, we have to use the ISO-8859-1 version and pass every string in the credits through make_mbstring() to make sure they're all UTF-8 (sigh) ------------------------------------------------------------------------ r2784 rework the credits handling to display Florian König's name properly whether we're in a UTF-8 locale or not. This requires a minor hack, but it's better than requiring a massive function that we only use once ------------------------------------------------------------------------ r2898 Update French manpages by Jean-Philippe Guérard. ------------------------------------------------------------------------ r3924 Update French manpages by Jean-Philippe Guérard. ------------------------------------------------------------------------ r4181 per Jean-Philippe Guérard's updates, in doc/man/fr/*.1, doc/man/fr/nanorc.5, fix copyright notices; the copyrights are disclaimed on these translations, but the copyrights of the untranslated works also apply ------------------------------------------------------------------------ r4182 per Jean-Philippe Guérard's updates, in doc/man/fr/*.1, doc/man/fr/nanorc.5, fix copyright notices; the copyrights are disclaimed on these translations, but the copyrights of the untranslated works also apply ------------------------------------------------------------------------ r4208 in print_opt_full(), use strlenpt() instead of strlen(), so that tabs are placed properly when displaying translated strings in UTF-8, as found by Jean-Philippe Guérard ------------------------------------------------------------------------ The corrupted-looking entries are the ones where the log message is incorrectly stored in ISO-8859-1.
#! /usr/bin/env perl use strict; use warnings; use Encode qw( from_to ); use SVN::Ra; sub is_ascii { my @chars = split(//, shift); for my $c (@chars) { if (ord($c) >= 128) { return 0; } } 1; } my $ra = SVN::Ra->new("svn://svn.sv.gnu.org/nano"); $ra->get_log('', 1, $ra->get_latest_revnum, 0, 1, 0, sub { my ($paths, $rev_num, $user, $datetime, $log_msg) = @_; if (not is_ascii($log_msg)) { print "------------------------------------------------------------------------\n"; print "r", $rev_num, "\n"; print $log_msg, "\n"; } }); print "------------------------------------------------------------------------\n";