[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1543: LUCENE-9378: Disable compression on binary values whose length is less than 32.

GitBox Wed, 17 Jun 2020 08:29:10 -0700


mikemccand commented on a change in pull request #1543:
URL: https://github.com/apache/lucene-solr/pull/1543#discussion_r441635457




##########
File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -762,6 +764,97 @@ public BytesRef binaryValue() throws IOException {
   // Decompresses blocks of binary values to retrieve content
   class BinaryDecoder {
     
+    private final LongValues addresses;
+    private final IndexInput compressedData;
+    // Cache of last uncompressed block 
+    private long lastBlockId = -1;
+    private final ByteBuffer deltas;
+    private int numBytes;
+    private int uncompressedBlockLength;  
+    private int avgLength;
+    private final byte[] uncompressedBlock;
+    private final BytesRef uncompressedBytesRef;
+    private final int docsPerChunk;
+    private final int docsPerChunkShift;
+    
+    public BinaryDecoder(LongValues addresses, IndexInput compressedData, int 
biggestUncompressedBlockSize, int docsPerChunkShift) {
+      super();
+      this.addresses = addresses;
+      this.compressedData = compressedData;
+      // pre-allocate a byte array large enough for the biggest uncompressed 
block needed.
+      this.uncompressedBlock = new byte[biggestUncompressedBlockSize];
+      uncompressedBytesRef = new BytesRef(uncompressedBlock);
+      this.docsPerChunk = 1 << docsPerChunkShift;
+      this.docsPerChunkShift = docsPerChunkShift;
+      deltas = ByteBuffer.allocate((docsPerChunk + 1) * Integer.BYTES);
+      deltas.order(ByteOrder.LITTLE_ENDIAN);
+    }
+
+    private void decodeBlock(int blockId) throws IOException {
+      long blockStartOffset = addresses.get(blockId);
+      compressedData.seek(blockStartOffset);
+
+      final long token = compressedData.readVLong();
+      uncompressedBlockLength = (int) (token >>> 4);
+      avgLength = uncompressedBlockLength >>> docsPerChunkShift;
+      numBytes = (int) (token & 0x0f);
+      switch (numBytes) {
+        case Integer.BYTES:
+          deltas.putInt(0, (int) 0);
+          compressedData.readBytes(deltas.array(), Integer.BYTES, docsPerChunk 
* Integer.BYTES);
+          break;
+        case Byte.BYTES:
+          compressedData.readBytes(deltas.array(), Byte.BYTES, docsPerChunk * 
Byte.BYTES);
+          break;
+        case 0:
+          break;
+        default:
+          throw new CorruptIndexException("Invalid number of bytes: " + 
numBytes, compressedData);
+      }
+
+      if (uncompressedBlockLength == 0) {
+        uncompressedBytesRef.offset = 0;
+        uncompressedBytesRef.length = 0;
+      } else {
+        assert uncompressedBlockLength <= uncompressedBlock.length;
+        LZ4.decompress(compressedData, uncompressedBlockLength, 
uncompressedBlock);
+      }
+    }
+
+    BytesRef decode(int docNumber) throws IOException {
+      int blockId = docNumber >> docsPerChunkShift; 
+      int docInBlockId = docNumber % docsPerChunk;
+      assert docInBlockId < docsPerChunk;
+      
+      
+      // already read and uncompressed?
+      if (blockId != lastBlockId) {
+        decodeBlock(blockId);
+        lastBlockId = blockId;
+      }
+
+      int startDelta = 0, endDelta = 0;
+      switch (numBytes) {
+        case Integer.BYTES:
+          startDelta = deltas.getInt(docInBlockId * Integer.BYTES);
+          endDelta = deltas.getInt((docInBlockId + 1) * Integer.BYTES);

Review comment:
       Aha!  Sneaky :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1543: LUCENE-9378: Disable compression on binary values whose length is less than 32.

Reply via email to