mikemccand commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1400539919


##########
lucene/core/src/java/org/apache/lucene/util/fst/ByteBuffersFSTReader.java:
##########
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+import org.apache.lucene.store.ByteBuffersDataOutput;
+import org.apache.lucene.store.DataOutput;
+
+/** An adapter class to use {@link ByteBuffersDataOutput} as a {@link 
FSTReader} */
+final class ByteBuffersFSTReader extends DataOutput implements FSTReader {

Review Comment:
   It's kinda of jarring that this class is named `XFSTReader`, implements 
`FSTReader`, yet extends `DataOutput`?  Is it a reader or a writer?



##########
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##########
@@ -123,10 +124,7 @@ public enum INPUT_TYPE {
   /** If arc has this label then that arc is final/accepted */
   public static final int END_LABEL = -1;
 
-  /**
-   * A {@link BytesStore}, used during building, or during reading when the 
FST is very large (more
-   * than 1 GB). If the FST is less than 1 GB then bytesArray is set instead.
-   */
+  /** The reader of the FST */

Review Comment:
   Maybe `Used to read bytes from the underlying FST storage` or so?



##########
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##########
@@ -435,6 +433,13 @@ public FST(FSTMetadata<T> metadata, DataInput in, 
Outputs<T> outputs, FSTStore f
     this.fstReader = fstReader;
   }
 
+  /**
+   * @return true if and only if this FST is readable (i.e. has a reverse 
BytesReader)
+   */
+  public boolean hasReverseBytesReader() {

Review Comment:
   Is it only an FST being actively written by `FSTCompiler` that would not be 
readable?



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -120,22 +120,45 @@ public class FSTCompiler<T> {
   final float directAddressingMaxOversizingFactor;
   long directAddressingExpansionCredit;
 
-  final BytesStore bytes;
+  // the DataOutput to write the FST to

Review Comment:
   Maybe `to stream the FST bytes to`?  Emphasize the append-only-ness of it.



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -441,23 +501,25 @@ long addNode(FSTCompiler.UnCompiledNode<T> nodeIn) throws 
IOException {
       boolean continuousLabel = labelRange == nodeIn.numArcs;
       if (continuousLabel) {
         writeNodeForDirectAddressingOrContinuous(
-            nodeIn, startAddress, maxBytesPerArcWithoutLabel, labelRange, 
true);
+            nodeIn, maxBytesPerArcWithoutLabel, labelRange, true);
         continuousNodeCount++;
       } else if (shouldExpandNodeWithDirectAddressing(
           nodeIn, maxBytesPerArc, maxBytesPerArcWithoutLabel, labelRange)) {
         writeNodeForDirectAddressingOrContinuous(
-            nodeIn, startAddress, maxBytesPerArcWithoutLabel, labelRange, 
false);
+            nodeIn, maxBytesPerArcWithoutLabel, labelRange, false);
         directAddressingNodeCount++;
       } else {
-        writeNodeForBinarySearch(nodeIn, startAddress, maxBytesPerArc);
+        writeNodeForBinarySearch(nodeIn, maxBytesPerArc);
         binarySearchNodeCount++;
       }
     }
 
-    final long thisNodeAddress = bytes.getPosition() - 1;
-    bytes.reverse(startAddress, thisNodeAddress);
+    reverseScratchBytes();

Review Comment:
   Yay, thank you!  So much cleaner that caller (us, here) do this reversal 
than the underlying storage impl.



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -568,18 +629,44 @@ private void writeNodeForBinarySearch(
                   + arcLen
                   + " nodeIn.numArcs="
                   + nodeIn.numArcs;
-          bytes.copyBytes(srcPos, destPos, arcLen);
+          // copy the bytes from srcPos to destPos, essentially expanding the 
arc from variable
+          // length to fixed length
+          writeScratchBytes(destPos, scratchBytes.getBytes(), srcPos, arcLen);
         }
       }
     }
 
-    // Write the header.
-    bytes.writeBytes(startAddress, fixedLengthArcsBuffer.getBytes(), 0, 
headerLen);
+    // Finally write the header
+    writeScratchBytes(0, fixedLengthArcsBuffer.getBytes(), 0, headerLen);
+  }
+
+  /** Reverse the scratch bytes */
+  private void reverseScratchBytes() {
+    int pos = scratchBytes.getPosition();
+    byte[] bytes = scratchBytes.getBytes();
+    int limit = pos / 2;
+    for (int i = 0; i < limit; i++) {
+      byte b = bytes[i];
+      bytes[i] = bytes[pos - 1 - i];
+      bytes[pos - 1 - i] = b;
+    }
+  }
+
+  /**
+   * Write bytes from a source byte[] to the scratch bytes

Review Comment:
   Maybe add that the written bytes must fit within what's already been written 
to `scratchBytes`?  I.e. you are only "rewriting" what was already written, 
never extending the `scratchBytes`?  And maybe also state that 
`scratchBytes.getPosition()` is unaffected by this.



##########
lucene/core/src/java/org/apache/lucene/util/fst/GrowableByteArrayDataOutput.java:
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.util.Accountable;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.RamUsageEstimator;
+
+// Storing a byte[] for the current node of the FST we are writing. The byte[] 
will only grow, never
+// shrink.
+final class GrowableByteArrayDataOutput extends DataOutput implements 
Accountable {
+
+  private static final long BASE_RAM_BYTES_USED =
+      
RamUsageEstimator.shallowSizeOfInstance(GrowableByteArrayDataOutput.class);
+
+  private static final int INITIAL_SIZE = 1 << 8;
+
+  // holds an initial size of 256 bytes. this byte array will only grow, but 
not shrink
+  private byte[] bytes = new byte[INITIAL_SIZE];
+
+  private int nextWrite;
+
+  @Override
+  public void writeByte(byte b) {
+    ensureCapacity(1);
+    bytes[nextWrite++] = b;
+  }
+
+  @Override
+  public void writeBytes(byte[] b, int offset, int len) {
+    ensureCapacity(len);
+    System.arraycopy(b, offset, bytes, nextWrite, len);
+    nextWrite += len;
+  }
+
+  public int getPosition() {
+    return nextWrite;
+  }
+
+  public byte[] getBytes() {
+    return bytes;
+  }
+
+  /** Set the position of the byte[], increasing the capacity if needed */
+  public void setPosition(int newLen) {
+    assert newLen >= 0;
+    if (newLen > nextWrite) {
+      ensureCapacity(newLen - nextWrite);
+    }
+    nextWrite = newLen;

Review Comment:
   Hmm should we ever downsize the `byte[]`?  When building a big FST it will 
grow to hold biggest frozen node it's seen so far?  I think that will tend to 
be towards the very end of FST building?  Also, the max frozen FST node size is 
... not too big?  I suppose if you are using full Unicode alphabet it might be 
massive, hmm.  I don't think we need to shrink for now?



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -218,13 +269,19 @@ public Builder<T> allowFixedLengthArcs(boolean 
allowFixedLengthArcs) {
     }
 
     /**
-     * How many bits wide to make each byte[] block in the BytesStore; if you 
know the FST will be
-     * large then make this larger. For example 15 bits = 32768 byte pages.
+     * Set the {@link DataOutput} which is used for low-level writing of FST. 
If you want the FST to
+     * be readable, you need to use a DataOutput that also implements {@link 
FSTReader}, such as

Review Comment:
   Maybe `to be immediately readable`?



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -120,22 +120,45 @@ public class FSTCompiler<T> {
   final float directAddressingMaxOversizingFactor;
   long directAddressingExpansionCredit;
 
-  final BytesStore bytes;
+  // the DataOutput to write the FST to
+  final DataOutput dataOutput;
+
+  // buffer to store bytes for the one node we are currently writing
+  final GrowableByteArrayDataOutput scratchBytes = new 
GrowableByteArrayDataOutput();
+
+  private long numBytesWritten;
+
+  /**
+   * Get an on-heap DataOutput that allows the FST to be read immediately 
after writing.
+   *
+   * @param blockBits how many bits wide to make each block of the DataOutput
+   * @return the DataOutput
+   */
+  public static DataOutput getOnHeapDataOutput(int blockBits) {

Review Comment:
   Nice.



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -153,6 +176,34 @@ private FSTCompiler(
     }
   }
 
+  // Get the respective FSTReader of the DataOutput. If the DataOutput is also 
a FSTReader then we
+  // will use it. Otherwise, we will use NullFSTReader, which does not allow 
reading.
+  private FSTReader toFSTReader(DataOutput dataOutput) {
+    if (dataOutput instanceof FSTReader) {
+      return (FSTReader) dataOutput;
+    }
+    return new NullFSTReader();
+  }
+
+  private static final class NullFSTReader implements FSTReader {
+
+    @Override
+    public FST.BytesReader getReverseBytesReader() {
+      return null;
+    }
+
+    @Override
+    public void writeTo(DataOutput out) {
+      throw new UnsupportedOperationException("writeTo(DataOutput) is not 
supported");
+    }
+
+    @Override
+    public long ramBytesUsed() {
+      return 0;
+    }
+  }
+  ;

Review Comment:
   Hmm is this one of those deadly lurking semicolons?



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -568,18 +629,44 @@ private void writeNodeForBinarySearch(
                   + arcLen
                   + " nodeIn.numArcs="
                   + nodeIn.numArcs;
-          bytes.copyBytes(srcPos, destPos, arcLen);
+          // copy the bytes from srcPos to destPos, essentially expanding the 
arc from variable
+          // length to fixed length
+          writeScratchBytes(destPos, scratchBytes.getBytes(), srcPos, arcLen);
         }
       }
     }
 
-    // Write the header.
-    bytes.writeBytes(startAddress, fixedLengthArcsBuffer.getBytes(), 0, 
headerLen);
+    // Finally write the header
+    writeScratchBytes(0, fixedLengthArcsBuffer.getBytes(), 0, headerLen);
+  }
+
+  /** Reverse the scratch bytes */

Review Comment:
   Maybe add `does not affect scratchBytes.getPosition()`.  And say "in place", 
e.g. `Reverse the scratch bytes in place`?



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -153,6 +176,34 @@ private FSTCompiler(
     }
   }
 
+  // Get the respective FSTReader of the DataOutput. If the DataOutput is also 
a FSTReader then we
+  // will use it. Otherwise, we will use NullFSTReader, which does not allow 
reading.
+  private FSTReader toFSTReader(DataOutput dataOutput) {
+    if (dataOutput instanceof FSTReader) {
+      return (FSTReader) dataOutput;
+    }
+    return new NullFSTReader();

Review Comment:
   Can we throw an exception instead?  I'd rather not mask (from the caller) 
that what they are trying to do won't work.  Like if they try to read anything 
from the `NullFSTReader` it will fail?



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -277,9 +336,9 @@ public long getMappedStateCount() {
     return dedupHash == null ? 0 : nodeCount;
   }
 
-  private CompiledNode compileNode(UnCompiledNode<T> nodeIn, int tailLength) 
throws IOException {
+  private CompiledNode compileNode(UnCompiledNode<T> nodeIn) throws 
IOException {

Review Comment:
   Ahh this `int tailLength` was unused now?



##########
lucene/core/src/java/org/apache/lucene/util/fst/GrowableByteArrayDataOutput.java:
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.util.Accountable;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.RamUsageEstimator;
+
+// Storing a byte[] for the current node of the FST we are writing. The byte[] 
will only grow, never

Review Comment:
   Maybe add `into a single contiguous byte[]`?  I.e. no blocks.



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -827,22 +910,24 @@ void setEmptyOutput(T v) {
   }
 
   void finish(long newStartNode) {
-    assert newStartNode <= bytes.size();
+    assert newStartNode <= numBytesWritten;
     if (fst.metadata.startNode != -1) {
       throw new IllegalStateException("already finished");
     }
     if (newStartNode == FINAL_END_NODE && fst.metadata.emptyOutput != null) {
       newStartNode = 0;
     }
     fst.metadata.startNode = newStartNode;
-    fst.metadata.numBytes = bytes.getPosition();
+    fst.metadata.numBytes = numBytesWritten;
   }
 
   private boolean validOutput(T output) {
     return output == NO_OUTPUT || !output.equals(NO_OUTPUT);
   }
 
   /** Returns final FST. NOTE: this will return null if nothing is accepted by 
the FST. */
+  // TODO: make this method to only return the FSTMetadata and user needs to 
construct the FST
+  // themselves
   public FST<T> compile() throws IOException {

Review Comment:
   Ahh good point -- what can this returned `FST<T>` actually do?  It's invalid 
to try to use it if you backed your `DataOutput` by e.g. an `IndexOutput` 
writing to disk, unable to give a `DataInput` immediately?  What happens if you 
try?  This seems like important TODO but will touch so much code ... we can 
defer.



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -248,15 +305,17 @@ public Builder<T> 
directAddressingMaxOversizingFactor(float factor) {
 
     /** Creates a new {@link FSTCompiler}. */
     public FSTCompiler<T> build() {
-      FSTCompiler<T> fstCompiler =
-          new FSTCompiler<>(
-              inputType,
-              suffixRAMLimitMB,
-              outputs,
-              allowFixedLengthArcs,
-              bytesPageBits,
-              directAddressingMaxOversizingFactor);
-      return fstCompiler;
+      try {
+        return new FSTCompiler<>(
+            inputType,
+            suffixRAMLimitMB,
+            outputs,
+            allowFixedLengthArcs,
+            dataOutput,
+            directAddressingMaxOversizingFactor);
+      } catch (IOException e) {
+        throw new RuntimeException(e);

Review Comment:
   Can we add `throws IOException` to the method signature?  It's important 
caller knows when writing to a storage device which has some problem that they 
may see a (legit) `IOException` now?



##########
lucene/core/src/java/org/apache/lucene/util/fst/GrowableByteArrayDataOutput.java:
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.util.Accountable;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.RamUsageEstimator;
+
+// Storing a byte[] for the current node of the FST we are writing. The byte[] 
will only grow, never
+// shrink.
+final class GrowableByteArrayDataOutput extends DataOutput implements 
Accountable {
+
+  private static final long BASE_RAM_BYTES_USED =
+      
RamUsageEstimator.shallowSizeOfInstance(GrowableByteArrayDataOutput.class);
+
+  private static final int INITIAL_SIZE = 1 << 8;
+
+  // holds an initial size of 256 bytes. this byte array will only grow, but 
not shrink
+  private byte[] bytes = new byte[INITIAL_SIZE];
+
+  private int nextWrite;
+
+  @Override
+  public void writeByte(byte b) {
+    ensureCapacity(1);
+    bytes[nextWrite++] = b;
+  }
+
+  @Override
+  public void writeBytes(byte[] b, int offset, int len) {
+    ensureCapacity(len);
+    System.arraycopy(b, offset, bytes, nextWrite, len);
+    nextWrite += len;
+  }
+
+  public int getPosition() {
+    return nextWrite;
+  }
+
+  public byte[] getBytes() {
+    return bytes;
+  }
+
+  /** Set the position of the byte[], increasing the capacity if needed */
+  public void setPosition(int newLen) {
+    assert newLen >= 0;
+    if (newLen > nextWrite) {
+      ensureCapacity(newLen - nextWrite);
+    }
+    nextWrite = newLen;
+  }
+
+  /**
+   * Ensure we can write additional capacityToWrite bytes.
+   *
+   * @param capacityToWrite the additional bytes to write
+   */
+  private void ensureCapacity(int capacityToWrite) {
+    bytes = ArrayUtil.grow(bytes, nextWrite + capacityToWrite);

Review Comment:
   Maybe `assert capacityToWrite > 0`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to