[GitHub] [pinot] richardstartin opened a new issue #8094: Make it possible to read strings from `Dictionary` as `byte[]`

GitBox Mon, 31 Jan 2022 04:50:24 -0800


richardstartin opened a new issue #8094:
URL: https://github.com/apache/pinot/issues/8094



   Currently reading values from `StringDictionary` is very expensive in terms 
of allocations.
   
   Firstly an oversized `byte[]` in `StringDictionary`
   
   ```java
     @Override
     public String getStringValue(int dictId) {
       return getUnpaddedString(dictId, getBuffer());
     }
   
     protected byte[] getBuffer() {
       return new byte[_numBytesPerValue];
     }
   ```
   
   Then a `String` is allocated once the size is known in 
`FixedByteValueReaderWriter`:
   
   ```java
     @Override
     public String getUnpaddedString(int index, int numBytesPerValue, byte 
paddingByte, byte[] buffer) {
       // Based on the ZeroInWord algorithm: 
http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
       assert buffer.length >= numBytesPerValue;
       long startOffset = (long) index * numBytesPerValue;
       long pattern = (paddingByte & 0xFFL) * 0x101010101010101L;
       ByteBuffer wrapper = ByteBuffer.wrap(buffer);
       if (_dataBuffer.order() == ByteOrder.LITTLE_ENDIAN) {
         wrapper.order(ByteOrder.LITTLE_ENDIAN);
       }
       int position = 0;
       for (int i = 0; i < ((numBytesPerValue >>> 3) << 3); i += 8) {
         long word = _dataBuffer.getLong(startOffset + i);
         wrapper.putLong(i, word);
         long zeroed = word ^ pattern;
         long tmp = (zeroed & 0x7F7F7F7F7F7F7F7FL) + 0x7F7F7F7F7F7F7F7FL;
         tmp = ~(tmp | zeroed | 0x7F7F7F7F7F7F7F7FL);
         if (tmp == 0) {
           position += 8;
         } else {
           position += _dataBuffer.order() == ByteOrder.LITTLE_ENDIAN
               ? Long.numberOfTrailingZeros(tmp) >>> 3
               : Long.numberOfLeadingZeros(tmp) >>> 3;
           return new String(buffer, 0, position, UTF_8);
         }
       }
       return getUnpaddedStringTail(startOffset, position, numBytesPerValue, 
paddingByte, buffer);
     }
   
     private String getUnpaddedStringTail(long startOffset, int position, int 
numBytesPerValue, byte paddingByte,
         byte[] buffer) {
       for (; position < numBytesPerValue; position++) {
         byte b = _dataBuffer.getByte(startOffset + position);
         if (b == paddingByte) {
           break;
         }
         buffer[position] = b;
       }
       return new String(buffer, 0, position, UTF_8);
     }
   ```
   
   Having a `byte[]` is often preferable to a `String` anyway, so this could be 
streamlined by calculating the length of the `String` and then allocating the 
correctly sized `byte[]`, since the buffer isn't reused anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [pinot] richardstartin opened a new issue #8094: Make it possible to read strings from `Dictionary` as `byte[]`

Reply via email to