Hi Mathias,
> > > I assume that the char[] returned form
> > > IndexableBinaryStringTools.encode is encoded in UTF-8 again
> > > and then stored. At some point the information is lost and
> > > cannot be recovered.
> >
> > Can you give an example? This should not happen.
>
> My character array returned by IndexableBinaryStringTools.encode looks
> like following:
>
> char[] encoded = new char[] {0, 8508, 3392, 64, 0, 8, 0, 0};
[...]
> BTW: I've tested it with EmbeddedSolrServer and Solr/Lucene trunk.
>
> Why has the string representation changed? From the changed string I
> cannot decode the correct ID.
Looks to me like the returned value is in a Solr-internal form of XML character
escaping: \u0000 is represented as "#0;" and \u0008 is represented as "#8;".
(The escaping code is in solr/src/java/org/apache/common/util/XML.java.)
You can get the value back in its original binary form by unescaping the
/#[0-9]+;/ format. Here is a test illustrating this fix that I added to
SolrExampleTests, then ran from SolrExampleEmbeddedTest:
==============
@Test
public void testIndexableBinary() throws Exception {
// Empty the database...
server.deleteByQuery( "*:*" );// delete everything!
server.commit();
assertNumFound( "*:*", 0 ); // make sure it got in
byte[] binary = new byte[]
{ (byte)0, (byte)0, (byte)0x84, (byte)0xF0, (byte)0x6A, (byte)0,
(byte)4, (byte)0, (byte)0, (byte)0, (byte)2, (byte)0 };
int encodedLen = IndexableBinaryStringTools.getEncodedLength
(binary, 0, binary.length);
char encoded[] = new char[encodedLen];
IndexableBinaryStringTools.encode
(binary, 0, binary.length, encoded, 0, encoded.length);
final String encodedString = new String(encoded);
log.info("Encoded: " + stringToIntSequence(encodedString));
// Expected encoded: { 0, 8508, 3392, 64, 0, 8, 0, 0 }
String expectedEncoded = "\u0000\u213C\u0D40\u0040\u0000\u0008\u0000\u0000";
assertEquals(stringToIntSequence(expectedEncoded),
stringToIntSequence(encodedString));
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", encodedString);
server.add(doc);
server.commit();
SolrQuery query = new SolrQuery();
query.setQuery("*:*");
QueryResponse rsp = server.query(query);
SolrDocument retrievedDoc = rsp.getResults().get(0);
String retrievedEncoded = (String)retrievedDoc.getFieldValue("id");
String unescapedRetrievedEncoded =
unescapeSolrXMLEscaping(retrievedEncoded);
assertEquals(stringToIntSequence(encodedString),
stringToIntSequence(unescapedRetrievedEncoded));
}
String stringToIntSequence(String str) {
StringBuilder builder = new StringBuilder();
for (int chnum = 0 ; chnum < str.length() ; ++chnum) {
if (chnum > 0) {
builder.append(", ");
}
builder.append((int)str.charAt(chnum))
.append(" (").append(str.charAt(chnum)).append(")");
}
return builder.toString();
}
String unescapeSolrXMLEscaping(String escaped) {
StringBuffer unescaped = new StringBuffer();
Matcher matcher = Pattern.compile("#(\\d+);").matcher(escaped);
while (matcher.find()) {
String replacement = String.format
("%c",(char)Integer.parseInt(matcher.group(1)));
matcher.appendReplacement(unescaped, replacement);
}
matcher.appendTail(unescaped);
return unescaped.toString();
}
==============
Steve