mikemccand commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391064549


##########
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##########
@@ -359,7 +383,9 @@ public void truncate(long newLen) {
     assert newLen == getPosition();
   }
 
-  public void finish() {
+  @Override
+  public void freeze() {
+    this.frozen = true;

Review Comment:
   Should we `assert frozen == false` before setting it to `true`?  Is caller 
allowed to `freeze()` more than once?
   
   And maybe remove `this.` prefix?



##########
lucene/core/src/java/org/apache/lucene/util/fst/Freezeable.java:
##########
@@ -0,0 +1,24 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+/** Represent a datastructure that can be frozen and become immutable. */
+public interface Freezeable {

Review Comment:
   Can this be package private?
   
   Also `Freezeable -> Freezable`, tricky.



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -120,31 +122,54 @@ public class FSTCompiler<T> {
   final float directAddressingMaxOversizingFactor;
   long directAddressingExpansionCredit;
 
-  final BytesStore bytes;
+  // writer for frozen nodes
+  final FSTDataOutputWriter fstWriter;
+  // reader for the frozen nodes
+  final FSTReader fstReader;
+
+  // buffer to store the scratch bytes before writing to the fstWriter
+  final BytesStore bytes = new BytesStore(DEFAULT_SCRATCH_PAGE_BITS);

Review Comment:
   Rename to `scratchBytes`, and maybe say in the comment `// buffer to store 
bytes for the one node we are currently writing`?
   
   Also, instead of `BytesStore`, could we use Lucene's existing `oal.store` 
`ByteBuffersDataOutput.newResettableInstance()`?  Let's try to get away from 
FST implementing so many of its own storage classes/interfaces...



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTDataOutputWriter.java:
##########
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.util.Accountable;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * Allow the FST to be written to a {@link DataOutput}. Generally it only 
supports writing to the
+ * {@link DataOutput} and not reading from it. To read, you must either: 1. 
construct a
+ * corresponding {@link org.apache.lucene.store.DataInput} and use the {@link 
FSTStore} to read pr
+ * 2. use a DataOutput which also implements {@link FSTReader}
+ */
+final class FSTDataOutputWriter implements Freezeable {
+
+  private static final long BASE_RAM_BYTES_USED =
+      RamUsageEstimator.shallowSizeOfInstance(FSTDataOutputWriter.class);
+
+  /** the main DataOutput to store the FST bytes */
+  private final DataOutput dataOutput;
+
+  private long size = 0L;

Review Comment:
   You don't need the `= 0L` -- it's java's default already.



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -120,31 +122,54 @@ public class FSTCompiler<T> {
   final float directAddressingMaxOversizingFactor;
   long directAddressingExpansionCredit;
 
-  final BytesStore bytes;
+  // writer for frozen nodes
+  final FSTDataOutputWriter fstWriter;
+  // reader for the frozen nodes
+  final FSTReader fstReader;
+
+  // buffer to store the scratch bytes before writing to the fstWriter
+  final BytesStore bytes = new BytesStore(DEFAULT_SCRATCH_PAGE_BITS);
 
   /**
    * Instantiates an FST/FSA builder with default settings and pruning options 
turned off. For more
    * tuning and tweaking, see {@link Builder}.
    */
   // TODO: remove this?  Builder API should be the only entry point?
   public FSTCompiler(FST.INPUT_TYPE inputType, Outputs<T> outputs) {
-    this(inputType, 32.0, outputs, true, 15, 1f);
+    this(
+        inputType,
+        32.0,
+        outputs,
+        true,
+        new FSTDataOutputWriter(getLegacyDataOutput(DEFAULT_BLOCK_BITS)),
+        1f);
+  }
+
+  static DataOutput getLegacyDataOutput(int blockBits) {
+    return new BytesStore(blockBits);
   }
 
   private FSTCompiler(
       FST.INPUT_TYPE inputType,
       double suffixRAMLimitMB,
       Outputs<T> outputs,
       boolean allowFixedLengthArcs,
-      int bytesPageBits,
+      FSTDataOutputWriter fstWriter,
       float directAddressingMaxOversizingFactor) {
     this.allowFixedLengthArcs = allowFixedLengthArcs;
     this.directAddressingMaxOversizingFactor = 
directAddressingMaxOversizingFactor;
-    bytes = new BytesStore(bytesPageBits);
     // pad: ensure no node gets address 0 which is reserved to mean
     // the stop state w/ no arcs
-    bytes.writeByte((byte) 0);
-    fst = new FST<>(new FST.FSTMetadata<>(inputType, null, -1, 
VERSION_CURRENT, 0), outputs, bytes);
+    try {
+      fstWriter.writeByte((byte) 0);
+    } catch (IOException e) {

Review Comment:
   Maybe we add `throws IOException` to this ctor?



##########
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##########
@@ -26,7 +26,8 @@
 // TODO: merge with PagedBytes, except PagedBytes doesn't
 // let you read while writing which FST needs
 
-class BytesStore extends DataOutput implements FSTReader {
+// TODO: Separate the scratch writer and reader functionality
+class BytesStore extends DataOutput implements FSTReader, Freezeable {

Review Comment:
   Could we remove this class entirely?  And callers that want to write FST and 
immediately use it in RAM should just use `ByteBuffersDataOutput` for their 
scratch area?
   
   Do we really need to publish a `Freezable` interface?  Can't it just be a 
`private boolean frozen` in this class?  Is anything needing freezing besides 
this `BytesStore`?
   
   And then `FSTCompiler` should only get a `DataOutput`?  Hmm this may require 
implementing a `ByteBuffersRandomAccessInput` or so, so the reverse 
(positional) byte reads work.



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -286,9 +331,9 @@ public long getMappedStateCount() {
     return dedupHash == null ? 0 : nodeCount;
   }
 
-  private CompiledNode compileNode(UnCompiledNode<T> nodeIn, int tailLength) 
throws IOException {
+  private CompiledNode compileNode(UnCompiledNode<T> nodeIn) throws 
IOException {
     final long node;
-    long bytesPosStart = bytes.getPosition();
+    long bytesPosStart = fstReader.size();

Review Comment:
   Can't we just record our current output position and use that as the address?



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -120,31 +122,54 @@ public class FSTCompiler<T> {
   final float directAddressingMaxOversizingFactor;
   long directAddressingExpansionCredit;
 
-  final BytesStore bytes;
+  // writer for frozen nodes
+  final FSTDataOutputWriter fstWriter;
+  // reader for the frozen nodes
+  final FSTReader fstReader;

Review Comment:
   I'm confused -- we don't ever need to read frozen nodes anymore, now that 
`NodeHash` takes a complete copy of the nodes it wants to read, and it decodes 
the node itself?  It seems like we (mostly?) just use this `fstReader`'s 
`.size()` method to know the address of the next written node/arc?  Can't we 
just track that ourselves (`numBytesWritten` or so)?



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -38,9 +38,6 @@
 import org.apache.lucene.util.IntsRefBuilder;
 import org.apache.lucene.util.fst.FST.INPUT_TYPE; // javadoc
 
-// TODO: could we somehow stream an FST to disk while we

Review Comment:
   Yay :)



##########
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##########
@@ -85,6 +84,9 @@ public class FSTCompiler<T> {
    */
   private static final float DIRECT_ADDRESSING_MAX_OVERSIZE_WITH_CREDIT_FACTOR 
= 1.66f;
 
+  // The number of bits for scratch area when adding a node
+  private static final int DEFAULT_SCRATCH_PAGE_BITS = 8;

Review Comment:
   Remove `DEFAULT_` prefix?  This is not controllable by the user right?  And 
move the constant to right before we create the scratch bytes?



##########
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##########
@@ -337,11 +349,23 @@ public long size() {
     return getPosition();
   }
 
+  /** Similar to {@link #truncate(long)} with newLen=0 but keep the first 
block to reduce GC. */
+  public void reset() {

Review Comment:
   If we can switch to Lucene's existing `ByteBuffersDataOutput` for our 
scratch single-node bytes then we can remove this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to