mikemccand commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1415505550
########## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ########## @@ -153,6 +180,40 @@ private FSTCompiler( } } + // Get the respective FSTReader of the DataOutput. If the DataOutput is also a FSTReader then we + // will use it, otherwise we will return a NullFSTReader. Attempting to read from a FST with + // NullFSTReader + // will throw UnsupportedOperationException + private FSTReader toFSTReader(DataOutput dataOutput) { + if (dataOutput instanceof FSTReader) { + return (FSTReader) dataOutput; + } + return NULL_FST_READER; + } + + /** + * This class is used for FST backed by non-FSTReader DataOutput. It does not allow getting the + * reverse BytesReader nor writing to a DataOutput. + */ + private static final class NullFSTReader implements FSTReader { + + @Override + public long ramBytesUsed() { + return 0; + } + + @Override + public FST.BytesReader getReverseBytesReader() { + throw new UnsupportedOperationException( + "NullFSTReader does not support getReverseBytesReader()"); + } + + @Override + public void writeTo(DataOutput out) { + throw new UnsupportedOperationException("NullFSTReader does not support writeTo(DataOutput)"); Review Comment: Could we improve this message, e.g. something like `FST was not constructed with getOnHeapReaderWriter` or so? `NullFSTReader` isn't something the user should even know about (it's a private class implementation detail). ########## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ########## @@ -218,13 +279,19 @@ public Builder<T> allowFixedLengthArcs(boolean allowFixedLengthArcs) { } /** - * How many bits wide to make each byte[] block in the BytesStore; if you know the FST will be - * large then make this larger. For example 15 bits = 32768 byte pages. + * Set the {@link DataOutput} which is used for low-level writing of FST. If you want the FST to + * be immediately readable, you need to use a DataOutput that also implements {@link FSTReader}, Review Comment: Does `FSTReader` need to be public and known to users? Could we make it package private and change this to say "you need to use `FSTCompiler#getOnHeapReaderWriter`"? ########## lucene/test-framework/src/java/org/apache/lucene/tests/util/fst/FSTTester.java: ########## @@ -316,6 +313,15 @@ public FST<T> doTest() throws IOException { return fst; } + protected FST<T> compile(FSTCompiler<T> fstCompiler) throws IOException { + return fstCompiler.compile(); + } + + protected FSTCompiler.Builder<T> getFSTBuilder() { + return new FSTCompiler.Builder<>( Review Comment: Could we randomize whether the "on disk" vs "on heap" `DataOutput` is used, so that `TestFSTs` randomly picks? We should randomize all three posibilities: * Create & use on-heap FST * Create on heap, save to disk, load FST from disk * Stream to disk, then load FST from disk ########## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ########## @@ -165,7 +226,7 @@ public static class Builder<T> { private final Outputs<T> outputs; private double suffixRAMLimitMB = 32.0; private boolean allowFixedLengthArcs = true; - private int bytesPageBits = 15; + private DataOutput dataOutput = getOnHeapReaderWriter(15); Review Comment: Hmm can we avoid even creating this, unless the caller didn't pass a `Builder.dataOutput` themselves? ########## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ########## @@ -120,22 +125,44 @@ public class FSTCompiler<T> { final float directAddressingMaxOversizingFactor; long directAddressingExpansionCredit; - final BytesStore bytes; + // the DataOutput to stream the FST bytes to + final DataOutput dataOutput; + + // buffer to store bytes for the one node we are currently writing + final GrowableByteArrayDataOutput scratchBytes = new GrowableByteArrayDataOutput(); + + private long numBytesWritten; + + /** + * Get an on-heap DataOutput that allows the FST to be read immediately after writing. Review Comment: Add `, and also optionally saved to an external DataOutput`? I.e. with these changes we support these possible FST workflows: * Build FST and use it immediately entirely in RAM and then discard it * Build FST and use it immediately entirely in RAM and also save it to disk (any other `DataOutput`), and load it later and use it * Build FST but stream it immediately to disk (except the `FSTMetaData`, saved at the end) and you cannot use it when done unless you go and open your own `DataInput` on the backing store and make a new FST from that Could we include this enumeration in `FSTCompiler''s class javadocs? Also: could you update `Test2BFSTs` to also test the 3rd bullet above? Right now it only tests the first 2 bullets. Actually, maybe create a new test, `Test2BFSTsToDisk` or so? That way we can limit heap size of that new test and confirm RAM is truly capped. That last bullet fully caps the RAM usage during compilation, yay! ########## lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java: ########## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import org.apache.lucene.store.ByteBuffersDataInput; +import org.apache.lucene.store.ByteBuffersDataOutput; +import org.apache.lucene.store.DataOutput; + +/** + * An adapter class to use {@link ByteBuffersDataOutput} as a {@link FSTReader}. It allows the FST + * to be readable immediately after writing + */ +final class ReadWriteDataOutput extends DataOutput implements FSTReader, Freezable { Review Comment: Could we get rid of `Freezable` and instead just cast to `ReadWriteDataOutput` and call its `.freeze()` method? We have only this concrete, package private class implementing this interface? ########## lucene/core/src/test/org/apache/lucene/util/fst/TestFSTDataOutputWriter.java: ########## @@ -0,0 +1,230 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +import static org.apache.lucene.tests.util.fst.FSTTester.toIntsRef; + +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.Random; +import org.apache.lucene.store.DataInput; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.InputStreamDataInput; +import org.apache.lucene.store.OutputStreamDataOutput; +import org.apache.lucene.tests.store.MockDirectoryWrapper; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.TestUtil; +import org.apache.lucene.tests.util.fst.FSTTester; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.IntsRef; + +public class TestFSTDataOutputWriter extends LuceneTestCase { + + private MockDirectoryWrapper dir; + + @Override + public void setUp() throws Exception { + super.setUp(); + dir = newMockDirectory(); + } + + @Override + public void tearDown() throws Exception { + // can be null if we force simpletext (funky, some kind of bug in test runner maybe) + if (dir != null) { + dir.close(); + } + super.tearDown(); + } + + public void testRandom() throws Exception { + + final int iters = atLeast(10); + final int maxBytes = TEST_NIGHTLY ? 200000 : 20000; + for (int iter = 0; iter < iters; iter++) { + final int numBytes = TestUtil.nextInt(random(), 1, maxBytes); + final byte[] expected = new byte[numBytes]; + final ByteArrayOutputStream baos = new ByteArrayOutputStream(); + final DataOutput dataOutput = new OutputStreamDataOutput(baos); Review Comment: Hmm what is this actually testing? It looks like it's testing `java.io.ByteArrayOutputStream` and our `OutputStreamDataOutput` wrapper on top? It's not about FST storage at all? Confused... Maybe move it to `OutputStreamDataOutput` test class? ########## lucene/core/src/java/org/apache/lucene/util/fst/Freezable.java: ########## @@ -0,0 +1,24 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +/** Represent a datastructure that can be frozen (i.e., no longer modified). */ +interface Freezable { Review Comment: Note that `FieldType` also implements this interface (it's own version). But let's not declare that! Keep this package private. ########## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ########## @@ -500,6 +502,12 @@ public FSTMetadata<T> getMetadata() { return metadata; } + /** + * Save the FST to DataOutput. + * + * @param metaOut the DataOutput to write the metadata to + * @param out the DataOutput to write the FST bytes to + */ public void save(DataOutput metaOut, DataOutput out) throws IOException { Review Comment: One need not call this if they passed their own `IndexOutput` when creating the compiler? ########## lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java: ########## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import org.apache.lucene.store.ByteBuffersDataInput; +import org.apache.lucene.store.ByteBuffersDataOutput; +import org.apache.lucene.store.DataOutput; + +/** + * An adapter class to use {@link ByteBuffersDataOutput} as a {@link FSTReader}. It allows the FST + * to be readable immediately after writing + */ +final class ReadWriteDataOutput extends DataOutput implements FSTReader, Freezable { + + private final ByteBuffersDataOutput dataOutput; + // the DataInput to read from in case the DataOutput has multiple blocks + private ByteBuffersDataInput dataInput; + // the ByteBuffers to read from in case the DataOutput has a single block + private ByteBuffer byteBuffer; + + public ReadWriteDataOutput(ByteBuffersDataOutput dataOutput) { + this.dataOutput = dataOutput; + } + + @Override + public void writeByte(byte b) { + dataOutput.writeByte(b); + } + + @Override + public void writeBytes(byte[] b, int offset, int length) { + dataOutput.writeBytes(b, offset, length); Review Comment: Can we throw an exception in all of these `write` methods if `.freeze()` was called already? Let's act frozen! ########## lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java: ########## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import org.apache.lucene.store.ByteBuffersDataInput; +import org.apache.lucene.store.ByteBuffersDataOutput; +import org.apache.lucene.store.DataOutput; + +/** + * An adapter class to use {@link ByteBuffersDataOutput} as a {@link FSTReader}. It allows the FST + * to be readable immediately after writing + */ +final class ReadWriteDataOutput extends DataOutput implements FSTReader, Freezable { + + private final ByteBuffersDataOutput dataOutput; + // the DataInput to read from in case the DataOutput has multiple blocks + private ByteBuffersDataInput dataInput; + // the ByteBuffers to read from in case the DataOutput has a single block + private ByteBuffer byteBuffer; + + public ReadWriteDataOutput(ByteBuffersDataOutput dataOutput) { + this.dataOutput = dataOutput; + } + + @Override + public void writeByte(byte b) { + dataOutput.writeByte(b); + } + + @Override + public void writeBytes(byte[] b, int offset, int length) { + dataOutput.writeBytes(b, offset, length); + } + + @Override + public long ramBytesUsed() { + return dataOutput.ramBytesUsed(); + } + + @Override + public void freeze() { + // these operations are costly, so we want to compute it once and cache + List<ByteBuffer> byteBuffers = dataOutput.toWriteableBufferList(); + if (byteBuffers.size() == 1) { Review Comment: Hmm can we just call `.toDataInput()`? I think it's already doing this "is it a single buffer" opto? Or was there even poor performance in that case? ########## lucene/core/src/java/org/apache/lucene/util/fst/Freezable.java: ########## @@ -0,0 +1,24 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +/** Represent a datastructure that can be frozen (i.e., no longer modified). */ Review Comment: `Represent` -> `Represents` ########## lucene/core/src/java/org/apache/lucene/util/fst/GrowableByteArrayDataOutput.java: ########## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +import java.io.IOException; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.util.Accountable; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.RamUsageEstimator; + +// Storing a single contiguous byte[] for the current node of the FST we are writing. The byte[] +// will only grow, never +// shrink. Review Comment: Can you further strengthen the comment, something like `This is only safe for usage that is bounded in the number of bytes written. Do not make this public! Public users should instead use ByteBuffersDataOutput` or so? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org