richardstartin opened a new issue #8094: URL: https://github.com/apache/pinot/issues/8094
Currently reading values from `StringDictionary` is very expensive in terms of allocations. Firstly an oversized `byte[]` in `StringDictionary` ```java @Override public String getStringValue(int dictId) { return getUnpaddedString(dictId, getBuffer()); } protected byte[] getBuffer() { return new byte[_numBytesPerValue]; } ``` Then a `String` is allocated once the size is known in `FixedByteValueReaderWriter`: ```java @Override public String getUnpaddedString(int index, int numBytesPerValue, byte paddingByte, byte[] buffer) { // Based on the ZeroInWord algorithm: http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord assert buffer.length >= numBytesPerValue; long startOffset = (long) index * numBytesPerValue; long pattern = (paddingByte & 0xFFL) * 0x101010101010101L; ByteBuffer wrapper = ByteBuffer.wrap(buffer); if (_dataBuffer.order() == ByteOrder.LITTLE_ENDIAN) { wrapper.order(ByteOrder.LITTLE_ENDIAN); } int position = 0; for (int i = 0; i < ((numBytesPerValue >>> 3) << 3); i += 8) { long word = _dataBuffer.getLong(startOffset + i); wrapper.putLong(i, word); long zeroed = word ^ pattern; long tmp = (zeroed & 0x7F7F7F7F7F7F7F7FL) + 0x7F7F7F7F7F7F7F7FL; tmp = ~(tmp | zeroed | 0x7F7F7F7F7F7F7F7FL); if (tmp == 0) { position += 8; } else { position += _dataBuffer.order() == ByteOrder.LITTLE_ENDIAN ? Long.numberOfTrailingZeros(tmp) >>> 3 : Long.numberOfLeadingZeros(tmp) >>> 3; return new String(buffer, 0, position, UTF_8); } } return getUnpaddedStringTail(startOffset, position, numBytesPerValue, paddingByte, buffer); } private String getUnpaddedStringTail(long startOffset, int position, int numBytesPerValue, byte paddingByte, byte[] buffer) { for (; position < numBytesPerValue; position++) { byte b = _dataBuffer.getByte(startOffset + position); if (b == paddingByte) { break; } buffer[position] = b; } return new String(buffer, 0, position, UTF_8); } ``` Having a `byte[]` is often preferable to a `String` anyway, so this could be streamlined by calculating the length of the `String` and then allocating the correctly sized `byte[]`, since the buffer isn't reused anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org