Re: [PR] Random access term dictionary [lucene]

via GitHub Mon, 13 Nov 2023 04:20:48 -0800


mikemccand commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391032959



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermStateCodec.java:
##########
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import 
org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPacker;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.util.BytesRef;
+
+interface TermStateCodec {

Review Comment:
   Hmm maybe don't name it `Codec` since it makes it seem like it could be a 
normal Lucene `Codec`?  But this is much lower level ... maybe 
`TermStateEncoder`?



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermStateCodecImpl.java:
##########
@@ -0,0 +1,234 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import 
org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import org.apache.lucene.index.IndexOptions;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.DocFreq;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.DocStartFP;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.LastPositionBlockOffset;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.PayloadStartFP;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.PositionStartFP;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.SingletonDocId;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.SkipOffset;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.TotalTermFreq;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPacker;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteArrayDataOutput;
+import org.apache.lucene.util.BytesRef;
+
+final class TermStateCodecImpl implements TermStateCodec {

Review Comment:
   Do we really need the separable interface and impl yet?  Are there more than 
one impl we are exploring?



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/bitpacking/FixedSizeByteArrayBitPacker.java:
##########
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking;
+
+/**
+ * A {@link BitPacker} implementation that requires user to know the size of 
the resulting byte
+ * array upfront, in order to avoid allocation and copying for dynamically 
growing the array.
+ */
+public final class FixedSizeByteArrayBitPacker extends BitPackerImplBase {
+  private final byte[] bytes;
+  private int numBytesUsed = 0;

Review Comment:
   No need for `= 0` -- it's already java's default.



##########
lucene/sandbox/src/test/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TestTermStateCodecImpl.java:
##########
@@ -0,0 +1,283 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.util.ArrayList;
+import 
org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import org.apache.lucene.index.IndexOptions;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPerBytePacker;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpackerImpl;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.FixedSizeByteArrayBitPacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.util.BytesRef;
+
+public class TestTermStateCodecImpl extends LuceneTestCase {
+
+  public void testEncodeDecode() {
+    TermStateCodecImpl codec =
+        new TermStateCodecImpl(
+            new TermStateCodecComponent[] {
+              TermStateCodecComponent.DocFreq.INSTANCE, 
TermStateCodecComponent.DocStartFP.INSTANCE,
+            });
+
+    ArrayList<IntBlockTermState> termStates = new ArrayList<>();
+    long maxDocFreqSeen = -1;
+    long docStartFPBase = random().nextLong(Long.MAX_VALUE >> 1);
+    long maxDocStartFPDeltaSeen = -1;
+    for (int i = 0; i < random().nextInt(2, 256); i++) {
+      var termState = new IntBlockTermState();
+      termState.docFreq = random().nextInt(1, 1 << random().nextInt(1, 31));
+      if (i == 0) {
+        termState.docStartFP = docStartFPBase;
+      } else {
+        termState.docStartFP = termStates.get(i - 1).docStartFP + 
random().nextLong(1024);
+        maxDocStartFPDeltaSeen =
+            Math.max(maxDocStartFPDeltaSeen, termState.docStartFP - 
docStartFPBase);
+      }
+      maxDocFreqSeen = Math.max(maxDocFreqSeen, termState.docFreq);
+      termStates.add(termState);
+    }
+
+    IntBlockTermState[] termStatesArray = 
termStates.toArray(IntBlockTermState[]::new);
+
+    BitPerBytePacker bitPerBytePacker = new BitPerBytePacker();
+    byte[] metadata = codec.encodeBlock(termStatesArray, bitPerBytePacker);
+
+    // For the metadata, we expect
+    // 0: DocFreq.bitWidth,
+    // 1: DocStartFP.bitWidth,
+    // [2-10]: DocStartFP.referenceValue;
+    int expectedDocFreqBitWidth = 64 - 
Long.numberOfLeadingZeros(maxDocFreqSeen);
+    int expectedDocStartFPBitWidth = 64 - 
Long.numberOfLeadingZeros(maxDocStartFPDeltaSeen);
+    assertEquals(10, metadata.length);
+    assertEquals(expectedDocFreqBitWidth, metadata[0]);
+    assertEquals(expectedDocStartFPBitWidth, metadata[1]);
+    ByteArrayDataInput byteArrayDataInput = new ByteArrayDataInput(metadata, 
2, 8);
+    assertEquals(docStartFPBase, byteArrayDataInput.readLong());
+
+    // Assert with real bit-packer we get the same bytes
+    FixedSizeByteArrayBitPacker fixedSizeByteArrayBitPacker =
+        new 
FixedSizeByteArrayBitPacker(bitPerBytePacker.getCompactBytes().length);
+    codec.encodeBlock(termStatesArray, fixedSizeByteArrayBitPacker);
+    assertArrayEquals(bitPerBytePacker.getCompactBytes(), 
fixedSizeByteArrayBitPacker.getBytes());
+
+    // Assert that each term state is the same after the encode-decode 
roundtrip.
+    BytesRef metadataBytes = new BytesRef(metadata);
+    BytesRef dataBytes = new BytesRef(bitPerBytePacker.getBytes());
+    assertBlockRoundTrip(termStatesArray, codec, metadataBytes, dataBytes, 
bitPerBytePacker);
+
+    // With real compact bits instead of bit-per-byte
+    dataBytes = new BytesRef(bitPerBytePacker.getCompactBytes());
+    assertBlockRoundTrip(
+        termStatesArray, codec, metadataBytes, dataBytes, 
BitUnpackerImpl.INSTANCE);
+
+    // Also test decoding that doesn't begin at the start of the block.
+    int pos = random().nextInt(termStatesArray.length);
+    int startBitIndex = pos > 0 ? random().nextInt(pos) : 0;
+    int recordSize = expectedDocFreqBitWidth + expectedDocStartFPBitWidth;
+    // With bit-per-byte bytes
+    dataBytes =
+        new BytesRef(bitPerBytePacker.getBytes(), pos * recordSize - 
startBitIndex, recordSize);
+    assertDecodeAt(
+        codec, metadataBytes, dataBytes, bitPerBytePacker, startBitIndex, 
termStatesArray[pos]);
+
+    // With compact bytes
+    int startByteIndex = pos * recordSize / 8;
+    int endByteIndex = (pos + 1) * recordSize / 8;
+    int length = endByteIndex - startByteIndex + ((pos + 1) * recordSize % 8 
== 0 ? 0 : 1);
+    dataBytes = new BytesRef(bitPerBytePacker.getCompactBytes(), 
startByteIndex, length);
+    assertDecodeAt(
+        codec,
+        metadataBytes,
+        dataBytes,
+        BitUnpackerImpl.INSTANCE,
+        (pos * recordSize) % 8,
+        termStatesArray[pos]);
+  }
+
+  private static void assertDecodeAt(
+      TermStateCodecImpl codec,
+      BytesRef metadataBytes,
+      BytesRef dataBytes,
+      BitUnpacker bitUnpacker,
+      int startBitIndex,
+      IntBlockTermState termState) {
+    IntBlockTermState decoded =
+        codec.decodeAt(metadataBytes, dataBytes, bitUnpacker, startBitIndex);
+    assertEquals(termState.docFreq, decoded.docFreq);
+    assertEquals(termState.docStartFP, decoded.docStartFP);
+  }
+
+  private static void assertBlockRoundTrip(
+      IntBlockTermState[] termStatesArray,
+      TermStateCodecImpl codec,
+      BytesRef metadataBytes,
+      BytesRef dataBytes,
+      BitUnpacker bitUnpacker) {
+    for (int i = 0; i < termStatesArray.length; i++) {
+      IntBlockTermState decoded = codec.decodeWithinBlock(metadataBytes, 
dataBytes, bitUnpacker, i);
+      assertEquals(termStatesArray[i].docFreq, decoded.docFreq);
+      assertEquals(termStatesArray[i].docStartFP, decoded.docStartFP);
+    }
+  }
+
+  public void testGetCodec() {
+    for (IndexOptions indexOptions : IndexOptions.values()) {
+      if (indexOptions == IndexOptions.NONE) {
+        continue;
+      }
+      for (int i = 0; i < 8; i++) {
+        if ((i & 0b011) == 0b011) {
+          continue;
+        }
+        if ((i & 0b100) == 0b100
+            && indexOptions.ordinal() < 
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal()) {
+          continue;
+        }
+        TermType termType = TermType.fromId(i);
+        var expected = getExpectedCodec(termType, indexOptions);
+        var got = TermStateCodecImpl.getCodec(termType, indexOptions);
+        assertEquals(expected, got);
+      }
+    }
+  }
+
+  // Enumerate the expected Codec we get for (TermType, IndexOptions) pairs.
+  static TermStateCodecImpl getExpectedCodec(TermType termType, IndexOptions 
indexOptions) {
+    ArrayList<TermStateCodecComponent> components = new ArrayList<>();
+    // Wish I can code this better in java...
+    switch (termType.getId()) {
+        // Not singleton doc; No skip data; No last position block offset
+      case 0b000 -> {
+        assert !termType.hasLastPositionBlockOffset()
+            && !termType.hasSkipData()
+            && !termType.hasSingletonDoc();
+        components.add(TermStateCodecComponent.DocStartFP.INSTANCE);
+        if (indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS.ordinal()) {
+          components.add(TermStateCodecComponent.DocFreq.INSTANCE);
+        }
+        if (indexOptions.ordinal() >= 
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal()) {
+          components.add(TermStateCodecComponent.TotalTermFreq.INSTANCE);
+          components.add(TermStateCodecComponent.PositionStartFP.INSTANCE);
+        }
+        if (indexOptions.ordinal()
+            >= 
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.ordinal()) {
+          components.add(TermStateCodecComponent.PayloadStartFP.INSTANCE);
+        }
+      }
+        // Singleton doc; No skip data; No last position block offset
+      case 0b001 -> {
+        assert !termType.hasLastPositionBlockOffset()
+            && !termType.hasSkipData()
+            && termType.hasSingletonDoc();
+        components.add(TermStateCodecComponent.SingletonDocId.INSTANCE);
+        // If field needs frequency, we need totalTermsFreq.
+        // Since there is only 1 doc, totalTermsFreq == docFreq.
+        if (indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS.ordinal()) {
+          components.add(TermStateCodecComponent.TotalTermFreq.INSTANCE);
+        }
+        if (indexOptions.ordinal() >= 
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal()) {
+          components.add(TermStateCodecComponent.PositionStartFP.INSTANCE);
+        }
+        if (indexOptions.ordinal()
+            >= 
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.ordinal()) {
+          components.add(TermStateCodecComponent.PayloadStartFP.INSTANCE);
+        }
+      }
+
+        // Not Singleton doc; Has skip data; No last position block offset
+      case 0b010 -> {
+        assert !termType.hasLastPositionBlockOffset()
+            && termType.hasSkipData()
+            && !termType.hasSingletonDoc();
+        components.add(TermStateCodecComponent.DocStartFP.INSTANCE);
+        components.add(TermStateCodecComponent.SkipOffset.INSTANCE);
+        if (indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS.ordinal()) {
+          components.add(TermStateCodecComponent.DocFreq.INSTANCE);
+        }
+        if (indexOptions.ordinal() >= 
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal()) {
+          components.add(TermStateCodecComponent.TotalTermFreq.INSTANCE);
+          components.add(TermStateCodecComponent.PositionStartFP.INSTANCE);
+        }
+        if (indexOptions.ordinal()
+            >= 
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.ordinal()) {
+          components.add(TermStateCodecComponent.PayloadStartFP.INSTANCE);
+        }
+      }
+        // Singleton doc but has skip data; Invalid state.
+      case 0b011, 0b111 -> {
+        assert termType.hasSkipData() && termType.hasSingletonDoc();
+        throw new IllegalStateException(
+            "Unreachable. A term has skip data but also only has one doc!? 
Must be a bug");
+      }
+        // Not singleton doc; No skip data; Has last position block offset;
+      case 0b100 -> {
+        assert termType.hasLastPositionBlockOffset()
+            && !termType.hasSkipData()
+            && !termType.hasSingletonDoc();
+        assert indexOptions.ordinal() >= 
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal();
+        components.add(TermStateCodecComponent.DocStartFP.INSTANCE);
+        components.add(TermStateCodecComponent.DocFreq.INSTANCE);
+        components.add(TermStateCodecComponent.TotalTermFreq.INSTANCE);
+        components.add(TermStateCodecComponent.PositionStartFP.INSTANCE);
+        
components.add(TermStateCodecComponent.LastPositionBlockOffset.INSTANCE);
+        if (indexOptions.ordinal()
+            >= 
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.ordinal()) {
+          components.add(TermStateCodecComponent.PayloadStartFP.INSTANCE);
+        }
+      }
+        // Singleton doc; No skip data; Has last position block offset;
+      case 0b101 -> {
+        assert termType.hasLastPositionBlockOffset()
+            && !termType.hasSkipData()
+            && termType.hasSingletonDoc();
+        assert indexOptions.ordinal() >= 
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal();
+        components.add(TermStateCodecComponent.SingletonDocId.INSTANCE);
+        components.add(TermStateCodecComponent.TotalTermFreq.INSTANCE);
+        components.add(TermStateCodecComponent.PositionStartFP.INSTANCE);
+        
components.add(TermStateCodecComponent.LastPositionBlockOffset.INSTANCE);
+        if (indexOptions.ordinal()
+            >= 
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.ordinal()) {
+          components.add(TermStateCodecComponent.PayloadStartFP.INSTANCE);
+        }
+      }
+        // Not singleton doc; Has skip data; Has last position block offset;
+      case 0b110 -> {

Review Comment:
   Oooh I like the binary-valued case switches!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Random access term dictionary [lucene]

Reply via email to