Re: [PR] Use group-varint encoding for the tail of postings [lucene]

via GitHub Fri, 17 Nov 2023 13:47:40 -0800


jpountz commented on code in PR #12782:
URL: https://github.com/apache/lucene/pull/12782#discussion_r1391047570



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVIntWriter.java:
##########
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene99;
+
+import java.io.IOException;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * Encode integers using group-varint. It uses VInt to encode tail values that 
are not enough for a
+ * group
+ */
+class GroupVIntWriter {
+  private int[] buffer = new int[4];
+  private byte[] bytes = new byte[16];
+  private int byteOffset = 0;
+  private int bufferOffset = 0;
+  private final IndexOutput output;
+
+  public GroupVIntWriter(IndexOutput output) {
+    this.output = output;
+  }
+
+  public void add(int v) throws IOException {
+    buffer = ArrayUtil.grow(buffer, bufferOffset + 1);
+    buffer[bufferOffset++] = v;
+  }
+
+  public void reset(int numValues) {

Review Comment:
   Let's remove the `numValues` parameter since it looks like we can't actually 
rely on it?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVIntReader.java:
##########
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene99;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataInput;
+
+/**
+ * Decode integers using group-varint. It will fully read the bytes for the 
block, to avoid repeated
+ * expensive bounds checking per readBytes.
+ */
+public class GroupVIntReader {
+  DataInput in;
+
+  public GroupVIntReader() {}
+
+  /** Called when decode a new block. */
+  public void reset(DataInput indexInput) throws IOException {
+    this.in = indexInput;
+  }
+
+  /** only readValues or nextInt can be called after reset */
+  public void readValues(long[] docs, int limit) throws IOException {

Review Comment:
   I'm not sure we need both a reset() and readValues(), maybe `readValues()` 
could take a `DataInput` directly?



##########
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/GroupVIntBenchmark.java:
##########
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.benchmark.jmh;
+
+import java.io.IOException;
+import java.nio.file.Files;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.ThreadLocalRandom;
+import java.util.concurrent.TimeUnit;
+import org.apache.lucene.codecs.lucene99.GroupVIntReader;
+import org.apache.lucene.codecs.lucene99.GroupVIntWriter;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteArrayDataOutput;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.MMapDirectory;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.Warmup;
+
+@BenchmarkMode(Mode.Throughput)
+@OutputTimeUnit(TimeUnit.MICROSECONDS)
+@State(Scope.Benchmark)
+@Warmup(iterations = 3, time = 3)
+@Measurement(iterations = 5, time = 5)
+@Fork(
+    value = 1,
+    jvmArgsPrepend = {"--add-modules=jdk.unsupported"})
+public class GroupVIntBenchmark {
+
+  final int maxSize = 256;
+  final long[] values = new long[maxSize];
+
+  IndexInput byteBufferGVIntIn;
+  IndexInput byteBufferVIntIn;
+  GroupVIntReader byteBufferGVIntReader = new GroupVIntReader();
+
+  ByteArrayDataInput byteArrayVIntIn;
+  ByteArrayDataInput byteArrayGVIntIn;
+  GroupVIntReader byteArrayGVIntReader = new GroupVIntReader();
+
+  // @Param({"16", "32", "64", "128", "248"})
+  @Param({"64"})
+  public int size;
+
+  @Param({"1", "2", "3", "4"})
+  public int numBytesPerInt;
+
+  private final int[] maxValues = new int[] {0, 1 << 4, 1 << 12, 1 << 18, 1 << 
25};
+
+  void initArrayInput(List<Integer> docs) throws Exception {
+    byte[] gVIntBytes = new byte[Integer.BYTES * maxSize * 2];
+    byte[] vIntBytes = new byte[Integer.BYTES * maxSize * 2];
+    ByteArrayDataOutput vIntOut = new ByteArrayDataOutput(vIntBytes);
+    GroupVIntWriter w = new GroupVIntWriter(new 
ByteArrayDataOutput(gVIntBytes));
+    for (int v : docs) {
+      vIntOut.writeVInt(v);
+      w.add(v);
+    }
+    w.flush();
+    byteArrayVIntIn = new ByteArrayDataInput(vIntBytes);
+    byteArrayGVIntIn = new ByteArrayDataInput(gVIntBytes);
+    byteArrayGVIntReader.reset(byteArrayGVIntIn);
+  }
+
+  void initByteBufferInput(List<Integer> docs) throws Exception {
+    Directory dir = 
MMapDirectory.open(Files.createTempDirectory("groupvintdata"));
+    IndexOutput vintOut = dir.createOutput("vint", IOContext.DEFAULT);
+    IndexOutput gvintOut = dir.createOutput("gvint", IOContext.DEFAULT);
+
+    GroupVIntWriter w = new GroupVIntWriter(gvintOut);
+    for (int v : docs) {
+      w.add(v);
+      vintOut.writeVInt(v);
+    }
+    w.flush();
+    vintOut.close();
+    gvintOut.close();
+    byteBufferGVIntIn = dir.openInput("gvint", IOContext.DEFAULT);
+    byteBufferVIntIn = dir.openInput("vint", IOContext.DEFAULT);
+    byteBufferGVIntReader.reset(byteBufferGVIntIn);
+  }
+
+  @Setup(Level.Trial)
+  public void init() throws Exception {
+    List<Integer> docs = new ArrayList<>();
+    int max = maxValues[numBytesPerInt];
+    int min = max >> 1;
+    for (int i = 0; i < maxSize; i++) {
+      int v = ThreadLocalRandom.current().nextInt(min, max);
+      docs.add(v);
+    }
+    initByteBufferInput(docs);
+    initArrayInput(docs);
+  }
+
+  @Benchmark
+  public void byteBufferReadVInt() throws IOException {
+    byteBufferVIntIn.seek(0);
+    for (int i = 0; i < size; i++) {
+      values[i] = byteBufferVIntIn.readVInt();
+    }

Review Comment:
   Do we need to pass the `values` array to a `Blackhole` object to make sure 
that the JVM doesn't optimize away some of the decoding logic?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVIntWriter.java:
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene99;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * Encode integers using group-varint. It uses VInt to encode tail values that 
are not enough for a
+ * group
+ */
+public class GroupVIntWriter {
+  private int[] buffer = new int[4];
+  private byte[] bytes = new byte[16];
+  private int byteOffset = 0;
+  private int bufferOffset = 0;
+  private final DataOutput output;
+
+  public GroupVIntWriter(DataOutput output) {
+    this.output = output;
+  }
+
+  public void add(int v) throws IOException {
+    buffer = ArrayUtil.grow(buffer, bufferOffset + 1);
+    buffer[bufferOffset++] = v;
+  }
+
+  public void reset(int numValues) {
+    buffer = ArrayUtil.grow(buffer, numValues);
+    bufferOffset = 0;
+    byteOffset = 0;
+  }
+
+  public void flush() throws IOException {
+    if (bufferOffset == 0) {
+      return;
+    }
+    encodeValues(buffer, bufferOffset);
+  }

Review Comment:
   What about making the API look more like the reader API, ie. replace 
reset/add/flush with a single `writeValues(DataOutput, long[] values, int 
limit)` API?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVIntWriter.java:
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene99;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * Encode integers using group-varint. It uses VInt to encode tail values that 
are not enough for a
+ * group
+ */
+public class GroupVIntWriter {
+  private int[] buffer = new int[4];
+  private byte[] bytes = new byte[16];
+  private int byteOffset = 0;
+  private int bufferOffset = 0;
+  private final DataOutput output;
+
+  public GroupVIntWriter(DataOutput output) {
+    this.output = output;
+  }
+
+  public void add(int v) throws IOException {
+    buffer = ArrayUtil.grow(buffer, bufferOffset + 1);
+    buffer[bufferOffset++] = v;
+  }
+
+  public void reset(int numValues) {
+    buffer = ArrayUtil.grow(buffer, numValues);
+    bufferOffset = 0;
+    byteOffset = 0;
+  }
+
+  public void flush() throws IOException {
+    if (bufferOffset == 0) {
+      return;
+    }
+    encodeValues(buffer, bufferOffset);
+  }
+
+  private int encodeValue(int v) {
+    int lastOff = byteOffset;
+    while ((v & ~0xFF) != 0) {
+      bytes[byteOffset++] = (byte) (v & 0xFF);
+      v >>>= 8;
+    }
+    bytes[byteOffset++] = (byte) v;
+    return byteOffset - lastOff;
+  }
+
+  private void encodeValues(int[] values, int limit) throws IOException {
+    int off = 0;
+
+    // encode each group
+    while ((limit - off) >= 4) {
+      // the maximum size of one group is 4 ints + 1 byte flag.
+      bytes = ArrayUtil.grow(bytes, byteOffset + 17);

Review Comment:
   could we write the data to `out` here instead of growing the buffer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

Reply via email to