id3v2 doesn't handle unicode in text fields correctly. Or rather, the
problem is really that it doesn't do charset conversion to the tty
according to the locale (the charset in $LC_CTYPE). As a console
application it should use that charset for all input and output. The
attached patch adds the necessary charset conversions, and read/writes
tag text as utf-16 whenever there are chars outside iso-8859-1.

Patch notes:

1. It assumes iconv and a number of other gnu libc utilities. I've made
no attempt to be portable outside libc. For Debian that should be fine,
but it could be a problem if the patch goes upstream.

2. It assumes utf-16 support in libid3, which is present at least in the
Ubuntu packaging 3.8.3-7.2ubuntu4. I believe it came about as a
non-maintainer upload in 3.8.3-6.

3. It could introduce a compatibility problem, since earlier one could
read and write tags in iso-8859-1 regardless of the $LC_CTYPE in use.
There might be scripts using id3v2 that assumes that, without ensuring
that the locale environment is correct.

4. I also noticed that the writing of id3v1 tags, which doesn't support
utf-16 of course, do not do a sensible conversion from utf-16 to
iso-8859-1. That is a bug in libid3, but it's also effectively a compat
problem since earlier id3v2 could read and write full iso-8859-1 to
id3v1 tags as well.

5. It handles invalid id3v2 fields differently and sometimes worse. E.g.
I noticed a tag where some text fields had $03 as encoding flag (i.e.
utf-8 in id3v2.4) followed by a utf-16 BOM and then byte swapped
iso-8859-1. Without this patch id3v2 actually manages to show this mess
in a sensible way.

If such malformed tags are common then maybe better handling of them is
required. I think this tag was written by the id3mux gstreamer plugin,
which makes it kind of worrisome.

6. There's a bug in libid3 where ID3_Field.Size() returns the number of
bytes rather than the number of characters, even though the docs makes a
big number of it doing the opposite. Unfortunately I couldn't find any
way around that, so the patch depends on the continued presence of that
bug. :(

7. I've not tested it much more than with the main fields
artist/album/title.

Index: id3v2-0.1.12/charset.cpp
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ id3v2-0.1.12/charset.cpp	2011-05-01 16:40:04.491454657 +0200
@@ -0,0 +1,129 @@
+#include "charset.h"
+
+#include <errno.h>
+#include <iconv.h>
+#include <langinfo.h>
+#include <locale.h>
+#include <stdlib.h>
+#include <id3/misc_support.h>
+#include <id3/utils.h>
+
+char *GetID3EncText (const ID3_Frame *frame, ID3_FieldID fldName)
+{
+  return GetID3EncText (frame, fldName, (size_t) -1);
+}
+
+char *GetID3EncText (const ID3_Frame *frame, ID3_FieldID fldName, size_t idx)
+{
+  ID3_Field *fld = frame->GetField (fldName);
+  ID3_TextEnc enc = fld->GetEncoding();
+  size_t len = fld->Size(), inl;
+  char *enc_text;
+  const char *iconv_enc;
+
+  if (ID3TE_IS_DOUBLE_BYTE_ENC (enc)) {
+    enc_text = new char [(len + 1) * sizeof (unicode_t)];
+    if (idx == (size_t) -1)
+      len = fld->Get ((unicode_t *) enc_text, len);
+    else
+      len = fld->Get ((unicode_t *) enc_text, len, idx);
+    iconv_enc = "UTF-16BE";
+    // Seems that fld->Size() returns number of bytes afterall,
+    // contrary to docs. Can't find any way to avoid it, so we have to
+    // count on that it remains bug compatible. :(
+    len >>= 1;
+    inl = len * sizeof (unicode_t);
+  }
+  else {
+    enc_text = new char [len + 1];
+    if (idx == (size_t) -1)
+      len = fld->Get (enc_text, len);
+    else
+      len = fld->Get (enc_text, len, idx);
+    iconv_enc = "ISO-8859-1";
+    inl = len;
+  }
+
+  iconv_t cd = iconv_open (nl_langinfo (CODESET), iconv_enc);
+  if (cd != (iconv_t) -1) {
+    size_t outl = (len + 1) * MB_CUR_MAX;
+    char *mbs_text = new char [outl];
+    char *inp = enc_text, *outp = mbs_text;
+    while (1) {
+      if (iconv (cd, &inp, &inl, &outp, &outl) != (size_t) -1) {
+	*outp++ = 0;
+	if (outl > len) {
+	  char *realloced = new char [outp - mbs_text];
+	  memcpy ((void *) realloced, (void *) mbs_text, outp - mbs_text);
+	  delete[] mbs_text;
+	  mbs_text = realloced;
+	}
+	iconv_close (cd);
+	delete[] enc_text;
+	return mbs_text;
+      }
+      if (errno == EILSEQ) {
+	inp++;
+	if (ID3TE_IS_DOUBLE_BYTE_ENC (enc)) inp++;
+	inl--;
+	*outp++ = '?';
+	outl--;
+      }
+      else break;
+    }
+    iconv_close (cd);
+  }
+
+  delete[] enc_text;
+
+  // Lame fallback but no worse than earlier versions.
+  return ID3_GetString (frame, fldName);
+}
+
+void SetID3EncText (ID3_Frame *frame, ID3_FieldID fldName, char *text)
+{
+  // Try iso-8859-1 first.
+  iconv_t cd = iconv_open ("ISO-8859-1", nl_langinfo (CODESET));
+  if (cd != (iconv_t) -1) {
+    size_t inl = strlen (text);
+    size_t latin1_len = inl + 1;
+    char *latin1_text = new char [latin1_len];
+    char *inp = text, *outp = latin1_text;
+    if (iconv (cd, &inp, &inl, &outp, &latin1_len) != (size_t) -1) {
+      *outp = 0;
+      frame->Field (ID3FN_TEXTENC) = ID3TE_ASCII;
+      frame->Field (fldName) = latin1_text;
+      delete[] latin1_text;
+      iconv_close (cd);
+      return;
+    }
+    delete[] latin1_text;
+    iconv_close (cd);
+  }
+
+  // Then utf-16.
+  cd = iconv_open ("UTF-16BE", nl_langinfo (CODESET));
+  if (cd != (iconv_t) -1) {
+    size_t inl = strlen (text);
+    size_t utf16_len = inl + 1;
+    unicode_t *utf16_text = new unicode_t [utf16_len];
+    char *inp = text, *outp = (char *) utf16_text;
+    utf16_len *= sizeof (unicode_t);
+    if (iconv (cd, &inp, &inl, &outp, &utf16_len) != (size_t) -1) {
+      *outp = 0;
+      *(outp + 1) = 0;
+      frame->Field (ID3FN_TEXTENC) = ID3TE_UTF16;
+      frame->Field (fldName).SetEncoding (ID3TE_UTF16);
+      frame->Field (fldName) = utf16_text;
+      delete[] utf16_text;
+      iconv_close (cd);
+      return;
+    }
+    delete[] utf16_text;
+    iconv_close (cd);
+  }
+
+  // Lame fallback but no worse than earlier versions.
+  frame->Field (ID3FN_TEXTENC) = ID3TE_ASCII;
+  frame->Field (ID3FN_TEXT) = text;
+}
Index: id3v2-0.1.12/charset.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ id3v2-0.1.12/charset.h	2011-05-01 16:40:04.491454657 +0200
@@ -0,0 +1,10 @@
+#ifndef __CHARSET_H__
+#define __CHARSET_H__
+
+#include <id3/tag.h>
+
+char *GetID3EncText (const ID3_Frame *frame, ID3_FieldID fldName);
+char *GetID3EncText (const ID3_Frame *frame, ID3_FieldID fldName, size_t idx);
+void SetID3EncText (ID3_Frame *frame, ID3_FieldID fldName, char *text);
+
+#endif
Index: id3v2-0.1.12/id3v2.cpp
===================================================================
--- id3v2-0.1.12.orig/id3v2.cpp	2011-05-01 16:39:39.471454647 +0200
+++ id3v2-0.1.12/id3v2.cpp	2011-05-01 16:44:39.841454784 +0200
@@ -17,6 +17,7 @@
 #include <sys/stat.h>
 
 #include "genre.h"
+#include "charset.h"
 
 #define MAXNOFRAMES 1000
 
@@ -99,6 +100,8 @@
   
   int frameCounter = 0;
   
+  setlocale (LC_ALL, "");
+
   while (true)
   {
     int option_index = 0;
@@ -419,7 +422,7 @@
             delete todel;
           }
           if (strlen(frameList[ii].data) > 0) {
-            myFrame->Field(ID3FN_TEXT) = frameList[ii].data;
+            SetID3EncText(myFrame, ID3FN_TEXT, frameList[ii].data);
             myTag.AttachFrame(myFrame);
           }
           break;
@@ -432,17 +435,16 @@
             delete todel;
           }
           if (strlen(frameList[ii].data) > 0) {
-            myFrame->Field(ID3FN_TEXTENC) = ID3TE_ASCII;
             char *text;
             text = strchr(frameList[ii].data, ':');
             if (text == NULL) 
             {
-              myFrame->Field(ID3FN_TEXT) = frameList[ii].data;
+              SetID3EncText(myFrame, ID3FN_TEXT, frameList[ii].data);
             } else {
               *text = '\0';
               text++;
-              myFrame->Field(ID3FN_LANGUAGE) = frameList[ii].data;
-              myFrame->Field(ID3FN_TEXT) = text;
+              SetID3EncText(myFrame, ID3FN_LANGUAGE, frameList[ii].data);
+              SetID3EncText(myFrame, ID3FN_TEXT, text);
             }
             char * test = ID3_GetString(myFrame, ID3FN_TEXT);
             if (strlen(test) > 0) {
@@ -461,7 +463,7 @@
 
           if (pFrame != NULL) 
           {
-            currentTrackNum = ID3_GetString(pFrame, ID3FN_TEXT);
+            currentTrackNum = GetID3EncText(pFrame, ID3FN_TEXT);
             if (*currentTrackNum == '/') 
             {
               newTrackNum = (char *)malloc(strlen(currentTrackNum) 
@@ -476,7 +478,7 @@
             }
           }
           
-          myFrame->Field(ID3FN_TEXT) = frameList[ii].data;
+          SetID3EncText(myFrame, ID3FN_TEXT, frameList[ii].data);
           myTag.AttachFrame(myFrame);
 
           free(newTrackNum);
@@ -496,12 +498,12 @@
           text = strchr(frameList[ii].data, ':');
           if (text == NULL) 
           {
-            myFrame->Field(ID3FN_TEXT) = frameList[ii].data;
+            SetID3EncText(myFrame, ID3FN_TEXT, frameList[ii].data);
           } else {
             *text = '\0';
             text++;
-            myFrame->Field(ID3FN_DESCRIPTION) = frameList[ii].data;
-            myFrame->Field(ID3FN_TEXT) = text;
+            SetID3EncText(myFrame, ID3FN_DESCRIPTION, frameList[ii].data);
+            SetID3EncText(myFrame, ID3FN_TEXT, text);
           }
           if (strlen(ID3_GetString(myFrame, ID3FN_TEXT)) > 0) {
             myTag.AttachFrame(myFrame);
@@ -518,7 +520,7 @@
           text = strchr(frameList[ii].data, ':');
           if (text == NULL) 
           {
-            myFrame->Field(ID3FN_TEXT) = frameList[ii].data;
+            SetID3EncText(myFrame, ID3FN_TEXT, frameList[ii].data);
           } else {
          	*text = '\0';
           	text++;
@@ -526,13 +528,13 @@
           	lang = strchr(text, ':');
           	if (lang == NULL) 
           	{
-          	  myFrame->Field(ID3FN_DESCRIPTION) = frameList[ii].data;
-          	  myFrame->Field(ID3FN_TEXT) = text;
+                  SetID3EncText(myFrame, ID3FN_DESCRIPTION, frameList[ii].data);
+                  SetID3EncText(myFrame, ID3FN_TEXT, text);
           	} else {
           	  *lang = '\0';
           	  lang++;
-          	  myFrame->Field(ID3FN_DESCRIPTION) = frameList[ii].data;
-              myFrame->Field(ID3FN_TEXT) = text;
+                  SetID3EncText(myFrame, ID3FN_DESCRIPTION, frameList[ii].data);
+		  SetID3EncText(myFrame, ID3FN_TEXT, text);
               myFrame->Field(ID3FN_LANGUAGE) = lang;
             }
           }
@@ -555,10 +557,10 @@
               pFirstFrame = pFrame;
             }
 
-            char *tmp_desc = ID3_GetString(pFrame, ID3FN_DESCRIPTION);
-            char *tmp_my_desc = ID3_GetString(myFrame, ID3FN_DESCRIPTION);
-            char *tmp_lang = ID3_GetString(pFrame, ID3FN_LANGUAGE);
-            char *tmp_my_lang = ID3_GetString(myFrame, ID3FN_LANGUAGE);
+            char *tmp_desc = GetID3EncText(pFrame, ID3FN_DESCRIPTION);
+            char *tmp_my_desc = GetID3EncText(myFrame, ID3FN_DESCRIPTION);
+            char *tmp_lang = GetID3EncText(pFrame, ID3FN_LANGUAGE);
+            char *tmp_my_lang = GetID3EncText(myFrame, ID3FN_LANGUAGE);
             if ((strcmp(tmp_desc, tmp_my_desc) == 0) &&
                 (strcmp(tmp_lang, tmp_my_lang) == 0)) 
             {
@@ -615,7 +617,7 @@
         {
           char 
             *sURL = ID3_GetString(myFrame, ID3FN_URL),
-            *sDesc = ID3_GetString(myFrame, ID3FN_DESCRIPTION);
+            *sDesc = GetID3EncText(myFrame, ID3FN_DESCRIPTION);
           std::cout << "(" << sDesc << "): " << sURL << std::endl;
           delete [] sURL;
           delete [] sDesc;
@@ -627,7 +629,7 @@
           size_t nItems = myFrame->Field(ID3FN_TEXT).GetNumTextItems();
           for (size_t nIndex = 1; nIndex <= nItems; nIndex++)
           {
-            char *sPeople = ID3_GetString(myFrame, ID3FN_TEXT, nIndex);
+            char *sPeople = GetID3EncText(myFrame, ID3FN_TEXT, nIndex);
             std::cout << sPeople;
             delete [] sPeople;
             if (nIndex < nItems)
@@ -642,7 +644,7 @@
         {
           char
             *sMimeType = ID3_GetString(myFrame, ID3FN_MIMETYPE),
-            *sDesc     = ID3_GetString(myFrame, ID3FN_DESCRIPTION),
+            *sDesc     = GetID3EncText(myFrame, ID3FN_DESCRIPTION),
             *sFormat   = ID3_GetString(myFrame, ID3FN_IMAGEFORMAT);
           size_t
             nPicType   = myFrame->Field(ID3FN_PICTURETYPE).Get(),
@@ -658,9 +660,9 @@
         case ID3FID_GENERALOBJECT:
         {
           char 
-            *sMimeType = ID3_GetString(myFrame, ID3FN_TEXT), 
-            *sDesc = ID3_GetString(myFrame, ID3FN_DESCRIPTION), 
-            *sFileName = ID3_GetString(myFrame, ID3FN_FILENAME);
+            *sMimeType = ID3_GetString(myFrame, ID3FN_TEXT),
+            *sDesc = GetID3EncText(myFrame, ID3FN_DESCRIPTION), 
+            *sFileName = GetID3EncText(myFrame, ID3FN_FILENAME);
           size_t 
           nDataSize = myFrame->Field(ID3FN_DATA).Size();
           std::cout << "(" << sDesc << ")[" 
Index: id3v2-0.1.12/list.cpp
===================================================================
--- id3v2-0.1.12.orig/list.cpp	2011-05-01 16:39:39.501454647 +0200
+++ id3v2-0.1.12/list.cpp	2011-05-01 16:47:12.811454856 +0200
@@ -25,6 +25,7 @@
 #include <id3/misc_support.h>
 #include "frametable.h"
 #include "genre.h"
+#include "charset.h"
 
 const char *GetDescription(const ID3_FrameID eFrameID)
 {
@@ -119,7 +120,7 @@
         case ID3FID_ENCODERSETTINGS:
         case ID3FID_YEAR:
         {
-          char *sText = ID3_GetString(myFrame, ID3FN_TEXT);
+          char *sText = GetID3EncText(myFrame, ID3FN_TEXT);
           std::cout << sText << std::endl;
           delete [] sText;
           break;
@@ -128,7 +129,7 @@
         {
           const char* genre_str;
           int genre_id = 255;
-          char *sText = ID3_GetString(myFrame, ID3FN_TEXT);
+          char *sText = GetID3EncText(myFrame, ID3FN_TEXT);
           sscanf(sText, "(%d)", &genre_id);
           if (genre_id == 255) {
             genre_str = sText;
@@ -143,8 +144,8 @@
         case ID3FID_USERTEXT:
         {
           char 
-            *sText = ID3_GetString(myFrame, ID3FN_TEXT), 
-            *sDesc = ID3_GetString(myFrame, ID3FN_DESCRIPTION);
+            *sText = GetID3EncText(myFrame, ID3FN_TEXT), 
+            *sDesc = GetID3EncText(myFrame, ID3FN_DESCRIPTION);
           std::cout << "(" << sDesc << "): " << sText << std::endl;
           delete [] sText;
           delete [] sDesc;
@@ -153,8 +154,8 @@
         case ID3FID_TERMSOFUSE:
         {
           char 
-            *sText = ID3_GetString(myFrame, ID3FN_TEXT), 
-            *sLang = ID3_GetString(myFrame, ID3FN_LANGUAGE);
+            *sText = GetID3EncText(myFrame, ID3FN_TEXT), 
+            *sLang = GetID3EncText(myFrame, ID3FN_LANGUAGE);
           std::cout << "[" << sLang << "]: "
                << sText << std::endl;
           delete [] sText;
@@ -165,9 +166,9 @@
         case ID3FID_UNSYNCEDLYRICS:
         {
           char 
-            *sText = ID3_GetString(myFrame, ID3FN_TEXT), 
-            *sDesc = ID3_GetString(myFrame, ID3FN_DESCRIPTION), 
-            *sLang = ID3_GetString(myFrame, ID3FN_LANGUAGE);
+            *sText = GetID3EncText(myFrame, ID3FN_TEXT), 
+            *sDesc = GetID3EncText(myFrame, ID3FN_DESCRIPTION), 
+            *sLang = GetID3EncText(myFrame, ID3FN_LANGUAGE);
           std::cout << "(" << sDesc << ")[" << sLang << "]: "
                << sText << std::endl;
           delete [] sText;
@@ -193,7 +194,7 @@
         {
           char 
             *sURL = ID3_GetString(myFrame, ID3FN_URL),
-            *sDesc = ID3_GetString(myFrame, ID3FN_DESCRIPTION);
+            *sDesc = GetID3EncText(myFrame, ID3FN_DESCRIPTION);
           std::cout << "(" << sDesc << "): " << sURL << std::endl;
           delete [] sURL;
           delete [] sDesc;
@@ -205,7 +206,7 @@
           size_t nItems = myFrame->Field(ID3FN_TEXT).GetNumTextItems();
           for (size_t nIndex = 1; nIndex <= nItems; nIndex++)
           {
-            char *sPeople = ID3_GetString(myFrame, ID3FN_TEXT, nIndex);
+            char *sPeople = GetID3EncText(myFrame, ID3FN_TEXT, nIndex);
             std::cout << sPeople;
             delete [] sPeople;
             if (nIndex < nItems)
@@ -220,7 +221,7 @@
         {
           char
             *sMimeType = ID3_GetString(myFrame, ID3FN_MIMETYPE),
-            *sDesc     = ID3_GetString(myFrame, ID3FN_DESCRIPTION),
+            *sDesc     = GetID3EncText(myFrame, ID3FN_DESCRIPTION),
             *sFormat   = ID3_GetString(myFrame, ID3FN_IMAGEFORMAT);
           size_t
             nPicType   = myFrame->Field(ID3FN_PICTURETYPE).Get(),
@@ -237,8 +238,8 @@
         {
           char 
           *sMimeType = ID3_GetString(myFrame, ID3FN_MIMETYPE), 
-          *sDesc = ID3_GetString(myFrame, ID3FN_DESCRIPTION), 
-          *sFileName = ID3_GetString(myFrame, ID3FN_FILENAME);
+          *sDesc = GetID3EncText(myFrame, ID3FN_DESCRIPTION),
+          *sFileName = GetID3EncText(myFrame, ID3FN_FILENAME);
           size_t nDataSize = myFrame->GetField(ID3FN_DATA)->Size();
           std::cout << "(" << sDesc << ")[" 
                << sFileName << "]: " << sMimeType << ", " << nDataSize
Index: id3v2-0.1.12/Makefile
===================================================================
--- id3v2-0.1.12.orig/Makefile	2011-05-01 16:39:39.521454647 +0200
+++ id3v2-0.1.12/Makefile	2011-05-01 16:40:04.491454657 +0200
@@ -6,7 +6,7 @@
 CXXFLAGS+=	-Wall -I${PREFIX}/include/ -DVERSION="\"${VERSION}\"" #-DSORT_RUNTIME
 LDFLAGS+=	-L${PREFIX}/lib/ 
 
-id3v2:	convert.o list.o id3v2.o genre.o
+id3v2:	convert.o list.o id3v2.o genre.o charset.o
 	${CXX} ${LDFLAGS} -pedantic -Wall -g -o $@ $^ -lz -lid3
 
 create_map: create_map.o

Reply via email to