[GitHub] [lucene] original-brownbear commented on a diff in pull request #12455: Clean up writing String to ByteBuffersDataOutput

via GitHub Wed, 26 Jul 2023 04:19:45 -0700


original-brownbear commented on code in PR #12455:
URL: https://github.com/apache/lucene/pull/12455#discussion_r1274811181



##########
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java:
##########
@@ -549,66 +540,22 @@ private static int computeBlockSizeBitsFor(long bytes) {
     return blockBits;
   }
 
-  // TODO: move this block-based conversion to UnicodeUtil.
-
-  private static final long HALF_SHIFT = 10;
-  private static final int SURROGATE_OFFSET =
-      Character.MIN_SUPPLEMENTARY_CODE_POINT
-          - (UnicodeUtil.UNI_SUR_HIGH_START << HALF_SHIFT)
-          - UnicodeUtil.UNI_SUR_LOW_START;
-
-  /** A consumer-based UTF16-UTF8 encoder (writes the input string in smaller 
buffers.). */
-  private static int UTF16toUTF8(
-      final CharSequence s,
-      final int offset,
-      final int length,
-      byte[] buf,
-      IntConsumer bufferFlusher) {
-    int utf8Len = 0;
-    int j = 0;
-    for (int i = offset, end = offset + length; i < end; i++) {
-      final int chr = (int) s.charAt(i);
-
-      if (j + 4 >= buf.length) {
-        bufferFlusher.accept(j);
-        utf8Len += j;
-        j = 0;
-      }
-
-      if (chr < 0x80) buf[j++] = (byte) chr;
-      else if (chr < 0x800) {
-        buf[j++] = (byte) (0xC0 | (chr >> 6));
-        buf[j++] = (byte) (0x80 | (chr & 0x3F));
-      } else if (chr < 0xD800 || chr > 0xDFFF) {
-        buf[j++] = (byte) (0xE0 | (chr >> 12));
-        buf[j++] = (byte) (0x80 | ((chr >> 6) & 0x3F));
-        buf[j++] = (byte) (0x80 | (chr & 0x3F));
-      } else {
-        // A surrogate pair. Confirm valid high surrogate.
-        if (chr < 0xDC00 && (i < end - 1)) {
-          int utf32 = (int) s.charAt(i + 1);
-          // Confirm valid low surrogate and write pair.
-          if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) {
-            utf32 = (chr << 10) + utf32 + SURROGATE_OFFSET;
-            i++;
-            buf[j++] = (byte) (0xF0 | (utf32 >> 18));
-            buf[j++] = (byte) (0x80 | ((utf32 >> 12) & 0x3F));
-            buf[j++] = (byte) (0x80 | ((utf32 >> 6) & 0x3F));
-            buf[j++] = (byte) (0x80 | (utf32 & 0x3F));
-            continue;
-          }
-        }
-        // Replace unpaired surrogate or out-of-order low surrogate
-        // with substitution character.
-        buf[j++] = (byte) 0xEF;
-        buf[j++] = (byte) 0xBF;
-        buf[j++] = (byte) 0xBD;
+  /** Writes a long string in chunks */
+  private void writeLongString(final String s) throws IOException {
+    final int byteLen = UnicodeUtil.calcUTF16toUTF8Length(s, 0, s.length());
+    writeVInt(byteLen);
+    final byte[] buf =
+        new byte[Math.min(byteLen, UnicodeUtil.MAX_UTF8_BYTES_PER_CHAR * 
MAX_CHARS_PER_WINDOW)];
+    for (int i = 0, end = s.length(); i < end; ) {
+      int step = Math.min(end - i, MAX_CHARS_PER_WINDOW);
+      // don't split a surrogate pair, since the pair together takes 2 bytes 
per char but have space
+      // for 3 bytes per char it's safe to encode one more char here

Review Comment:
   Sure, I guess there's not much value in optimising for one char here :) I 
went with the -1 on the step size now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene] original-brownbear commented on a diff in pull request #12455: Clean up writing String to ByteBuffersDataOutput

Reply via email to