[jira] [Commented] (LUCENE-10482) Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide
[ https://issues.apache.org/jira/browse/LUCENE-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523347#comment-17523347 ] Michael McCandless commented on LUCENE-10482: - Thanks [~gworah] -- can we resolve this now? > Allow users to create their own DirectoryTaxonomyReaders with empty > taxoArrays instead of letting the taxoEpoch decide > -- > > Key: LUCENE-10482 > URL: https://issues.apache.org/jira/browse/LUCENE-10482 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 9.1 >Reporter: Gautam Worah >Priority: Minor > Time Spent: 6h 20m > Remaining Estimate: 0h > > I was experimenting with the taxonomy index and {{DirectoryTaxonomyReaders}} > in my day job where we were trying to replace the index underneath a reader > asynchronously and then call the {{doOpenIfChanged}} call on it. > It turns out that the taxonomy index uses its own index based counter (the > {{{}taxonomyIndexEpoch{}}}) to determine if the index was opened in write > mode after the last time it was written and if not, it directly tries to > reuse the previous {{taxoArrays}} it had created. This logic fails in a > scenario where both the old and new index were opened just once but the index > itself is completely different in both the cases. > In such a case, it would be good to give the user the flexibility to inform > the DTR to recreate its {{{}taxoArrays{}}}, {{ordinalCache}} and > {{{}categoryCache{}}} (not refreshing these arrays causes it to fail in > various ways). Luckily, such a constructor already exists! But it is private > today! The idea here is to allow subclasses of DTR to use this constructor. > Curious to see what other folks think about this idea. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a diff in pull request #805: LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori
mikemccand commented on code in PR #805: URL: https://github.com/apache/lucene/pull/805#discussion_r851776382 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/ViterbiNBest.java: ## @@ -0,0 +1,1253 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.EnumMap; +import java.util.HashMap; +import java.util.List; +import org.apache.lucene.analysis.ja.dict.CharacterDefinition; +import org.apache.lucene.analysis.ja.dict.JaMorphData; +import org.apache.lucene.analysis.ja.dict.TokenInfoDictionary; +import org.apache.lucene.analysis.ja.dict.UnknownDictionary; +import org.apache.lucene.analysis.ja.dict.UserDictionary; +import org.apache.lucene.analysis.morph.ConnectionCosts; +import org.apache.lucene.analysis.morph.Dictionary; +import org.apache.lucene.analysis.morph.GraphvizFormatter; +import org.apache.lucene.analysis.morph.TokenInfoFST; +import org.apache.lucene.analysis.morph.TokenType; +import org.apache.lucene.analysis.morph.Viterbi; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.fst.FST; + +/** + * {@link org.apache.lucene.analysis.morph.Viterbi} subclass for Japanese morphological analysis. + * This also performs n-best path calculation + */ +final class ViterbiNBest +extends org.apache.lucene.analysis.morph.Viterbi { + + private final EnumMap> dictionaryMap = + new EnumMap<>(TokenType.class); + + private final UnknownDictionary unkDictionary; + private final CharacterDefinition characterDefinition; + private final UserDictionary userDictionary; + + private final boolean discardPunctuation; + private final boolean searchMode; + private final boolean extendedMode; + private final boolean outputCompounds; + + // Allowable cost difference for N-best output: + private int nBestCost = 0; + + Lattice lattice = null; + + private GraphvizFormatter dotOut; + + ViterbiNBest( + TokenInfoFST fst, + FST.BytesReader fstReader, + TokenInfoDictionary dictionary, + TokenInfoFST userFST, + FST.BytesReader userFSTReader, + UserDictionary userDictionary, + ConnectionCosts costs, + Class positionImpl, + UnknownDictionary unkDictionary, + CharacterDefinition characterDefinition, + boolean discardPunctuation, + boolean searchMode, + boolean extendedMode, + boolean outputCompounds) { +super(fst, fstReader, dictionary, userFST, userFSTReader, userDictionary, costs, positionImpl); +this.unkDictionary = unkDictionary; +this.characterDefinition = characterDefinition; +this.userDictionary = userDictionary; +this.discardPunctuation = discardPunctuation; +this.searchMode = searchMode; +this.extendedMode = extendedMode; +this.outputCompounds = outputCompounds; +dictionaryMap.put(TokenType.KNOWN, dictionary); +dictionaryMap.put(TokenType.UNKNOWN, unkDictionary); +dictionaryMap.put(TokenType.USER, userDictionary); + } + + @Override + protected boolean shouldSkipProcessUnknownWord(int unknownWordEndIndex, Position posData) { +return !searchMode && super.shouldSkipProcessUnknownWord(unknownWordEndIndex, posData); + } + + private static final int SEARCH_MODE_KANJI_LENGTH = 2; + private static final int SEARCH_MODE_OTHER_LENGTH = 7; // Must be >= SEARCH_MODE_KANJI_LENGTH + private static final int SEARCH_MODE_KANJI_PENALTY = 3000; + private static final int SEARCH_MODE_OTHER_PENALTY = 1700; + + @Override + protected int computePenalty(int pos, int length) throws IOException { +if (length > SEARCH_MODE_KANJI_LENGTH) { + boolean allKanji = true; + // check if node consists of only kanji + final int endPos = pos + length; + for (int pos2 = pos; pos2 < endPos; pos2++) { +if (!characterDefinition.isKanji((char) buffer.get(pos2))) { + allKanji = false; + break; +} + } + if (allKanji) { // Process only Kanji keywords +return (length - SEARCH_MODE_KANJI_LENGTH) * SEARCH_MODE_KANJI_PENALTY; + } else if (length > SEARCH_MODE_OTHER_LENGT
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
mayya-sharipova commented on code in PR #792: URL: https://github.com/apache/lucene/pull/792#discussion_r851785737 ## lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsWriter.java: ## @@ -0,0 +1,328 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.lucene92; + +import static org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsFormat.DIRECT_MONOTONIC_BLOCK_SHIFT; +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.Arrays; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.index.DocsWithFieldSet; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.IndexFileNames; +import org.apache.lucene.index.RandomAccessVectorValuesProducer; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.index.VectorValues; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.hnsw.HnswGraph.NodesIterator; +import org.apache.lucene.util.hnsw.HnswGraphBuilder; +import org.apache.lucene.util.hnsw.NeighborArray; +import org.apache.lucene.util.hnsw.OnHeapHnswGraph; +import org.apache.lucene.util.packed.DirectMonotonicWriter; + +/** + * Writes vector values and knn graphs to index segments. + * + * @lucene.experimental + */ +public final class Lucene92HnswVectorsWriter extends KnnVectorsWriter { + + private final SegmentWriteState segmentWriteState; + private final IndexOutput meta, vectorData, vectorIndex; + private final int maxDoc; + + private final int maxConn; + private final int beamWidth; + private boolean finished; + + Lucene92HnswVectorsWriter(SegmentWriteState state, int maxConn, int beamWidth) + throws IOException { +this.maxConn = maxConn; +this.beamWidth = beamWidth; + +assert state.fieldInfos.hasVectorValues(); +segmentWriteState = state; + +String metaFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, state.segmentSuffix, Lucene92HnswVectorsFormat.META_EXTENSION); + +String vectorDataFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, +state.segmentSuffix, +Lucene92HnswVectorsFormat.VECTOR_DATA_EXTENSION); + +String indexDataFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, +state.segmentSuffix, +Lucene92HnswVectorsFormat.VECTOR_INDEX_EXTENSION); + +boolean success = false; +try { + meta = state.directory.createOutput(metaFileName, state.context); + vectorData = state.directory.createOutput(vectorDataFileName, state.context); + vectorIndex = state.directory.createOutput(indexDataFileName, state.context); + + CodecUtil.writeIndexHeader( + meta, + Lucene92HnswVectorsFormat.META_CODEC_NAME, + Lucene92HnswVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + CodecUtil.writeIndexHeader( + vectorData, + Lucene92HnswVectorsFormat.VECTOR_DATA_CODEC_NAME, + Lucene92HnswVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + CodecUtil.writeIndexHeader( + vectorIndex, + Lucene92HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME, + Lucene92HnswVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + maxDoc = state.segmentInfo.maxDoc(); + success = true; +} finally { + if (success == false) { +IOUtils.closeWhileHandlingException(this); + } +} + } + + @Override + public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) + throws IOException { +
[GitHub] [lucene] mayya-sharipova commented on pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
mayya-sharipova commented on PR #792: URL: https://github.com/apache/lucene/pull/792#issuecomment-1100923586 @LuXugang as per @jtibshirani' [comment](https://github.com/apache/lucene/pull/792#pullrequestreview-940285463) , the main factor for deciding if we can use `DirectMonotonicWriter/Reader` is the performance measure. We first need to see performance results to confirm the searches are not slowed down. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide
uschindler commented on PR #762: URL: https://github.com/apache/lucene/pull/762#issuecomment-1100941707 Hi, since this commit all Jenkins WIndows builds fail with this error: ``` FAILED: org.apache.lucene.facet.taxonomy.directory.TestAlwaysRefreshDirectoryTaxonomyReader.testAlwaysRefreshDirectoryTaxonomyReader Error Message: java.nio.file.InvalidPathException: Illegal char <:> at index 13: 2022-04-15T20:35:33.995886500Z-001 Stack Trace: java.nio.file.InvalidPathException: Illegal char <:> at index 13: 2022-04-15T20:35:33.995886500Z-001 at __randomizedtesting.SeedInfo.seed([D59CA6A5F246F40:34E0EFAFAAB6C033]:0) at java.base/sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182) at java.base/sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153) at java.base/sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77) at java.base/sun.nio.fs.WindowsPath.parse(WindowsPath.java:92) at java.base/sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:232) at java.base/java.nio.file.Path.resolve(Path.java:515) at [org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156)](mailto:org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156)) at [org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156)](mailto:org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156)) at [org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156)](mailto:org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156)) at [org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156)](mailto:org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156)) at [org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156)](mailto:org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156)) at [org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleTemporaryFilesCleanup.createTempDir(TestRuleTemporaryFilesCleanup.java:284)](mailto:org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleTemporaryFilesCleanup.createTempDir(TestRuleTemporaryFilesCleanup.java:284)) at [org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.LuceneTestCase.createTempDir(LuceneTestCase.java:3076)](mailto:org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.LuceneTestCase.createTempDir(LuceneTestCase.java:3076)) at org.apache.lucene.facet.taxonomy.directory.TestAlwaysRefreshDirectoryTaxonomyReader.testAlwaysRefreshDirectoryTaxonomyReader(TestAlwaysRefreshDirectoryTaxonomyReader.java:56) at org.apache.lucene.facet.taxonomy.directory.TestAlwaysRefreshDirectoryTaxonomyReader.testAlwaysRefreshDirectoryTaxonomyReader(TestAlwaysRefreshDirectoryTaxonomyReader.java:128) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at [randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)](mailto:randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)) at [randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)](mailto:randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)) at [randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)](mailto:randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)) at [randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)](mailto:randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)) at [org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.
[GitHub] [lucene] uschindler commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch deci
uschindler commented on code in PR #762: URL: https://github.com/apache/lucene/pull/762#discussion_r851798738 ## lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestAlwaysRefreshDirectoryTaxonomyReader.java: ## @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.taxonomy.directory; + +import static com.carrotsearch.randomizedtesting.RandomizedTest.sleep; +import static org.apache.lucene.tests.mockfile.ExtrasFS.isExtra; + +import java.io.IOException; +import java.nio.file.Path; +import java.time.Instant; +import java.util.List; +import java.util.function.Function; +import org.apache.lucene.facet.FacetTestCase; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.FacetsConfig; +import org.apache.lucene.facet.taxonomy.FacetLabel; +import org.apache.lucene.facet.taxonomy.SearcherTaxonomyManager; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.util.IOUtils; + +// Nefarious FS will delay/stop deletion of index files and this test specifically does that +@LuceneTestCase.SuppressFileSystems({"WindowsFS", "VirusCheckingFS"}) +public class TestAlwaysRefreshDirectoryTaxonomyReader extends FacetTestCase { + + /** + * Tests the behavior of the {@link AlwaysRefreshDirectoryTaxonomyReader} by testing if the + * associated {@link SearcherTaxonomyManager} can successfully refresh and serve queries if the + * underlying taxonomy index is changed to an older checkpoint. Ideally, each checkpoint should be + * self-sufficient and should allow serving search queries when {@link + * SearcherTaxonomyManager#maybeRefresh()} is called. + * + * It does not check whether the private taxoArrays were actually recreated or no. We are + * (correctly) hiding away that complexity away from the user. + */ + private void testAlwaysRefreshDirectoryTaxonomyReader( + Function dtrProducer, Class exceptionType) + throws IOException { +final Path taxoPath1 = createTempDir(String.valueOf(Instant.now())); Review Comment: This breaks on windows, becaue ":" is not a valid character for file names. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch deci
uschindler commented on code in PR #762: URL: https://github.com/apache/lucene/pull/762#discussion_r851798902 ## lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestAlwaysRefreshDirectoryTaxonomyReader.java: ## @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.taxonomy.directory; + +import static com.carrotsearch.randomizedtesting.RandomizedTest.sleep; +import static org.apache.lucene.tests.mockfile.ExtrasFS.isExtra; + +import java.io.IOException; +import java.nio.file.Path; +import java.time.Instant; +import java.util.List; +import java.util.function.Function; +import org.apache.lucene.facet.FacetTestCase; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.FacetsConfig; +import org.apache.lucene.facet.taxonomy.FacetLabel; +import org.apache.lucene.facet.taxonomy.SearcherTaxonomyManager; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.util.IOUtils; + +// Nefarious FS will delay/stop deletion of index files and this test specifically does that +@LuceneTestCase.SuppressFileSystems({"WindowsFS", "VirusCheckingFS"}) +public class TestAlwaysRefreshDirectoryTaxonomyReader extends FacetTestCase { + + /** + * Tests the behavior of the {@link AlwaysRefreshDirectoryTaxonomyReader} by testing if the + * associated {@link SearcherTaxonomyManager} can successfully refresh and serve queries if the + * underlying taxonomy index is changed to an older checkpoint. Ideally, each checkpoint should be + * self-sufficient and should allow serving search queries when {@link + * SearcherTaxonomyManager#maybeRefresh()} is called. + * + * It does not check whether the private taxoArrays were actually recreated or no. We are + * (correctly) hiding away that complexity away from the user. + */ + private void testAlwaysRefreshDirectoryTaxonomyReader( + Function dtrProducer, Class exceptionType) + throws IOException { +final Path taxoPath1 = createTempDir(String.valueOf(Instant.now())); Review Comment: Why do we need the actual time at all? Every test has its own directory, so theres no need to have time in the file name. createTempDir makes sure to have a unique file name anyways! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10482) Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide
[ https://issues.apache.org/jira/browse/LUCENE-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523439#comment-17523439 ] Uwe Schindler commented on LUCENE-10482: Hey the commit breaks all builds on Windows because the test file name contains a ":" (looks like it is coming from the timestamp in filename). Please fix or revert! Why do we need Instant.now() in the filename at all? createTempDir creates a unique test-dependend directory anyways. You just need to pass the test's name and youre done. Uniqueness is ensured by test framework. The problem with Instant.now() is also non-reproducibility! > Allow users to create their own DirectoryTaxonomyReaders with empty > taxoArrays instead of letting the taxoEpoch decide > -- > > Key: LUCENE-10482 > URL: https://issues.apache.org/jira/browse/LUCENE-10482 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 9.1 >Reporter: Gautam Worah >Priority: Minor > Time Spent: 6h 50m > Remaining Estimate: 0h > > I was experimenting with the taxonomy index and {{DirectoryTaxonomyReaders}} > in my day job where we were trying to replace the index underneath a reader > asynchronously and then call the {{doOpenIfChanged}} call on it. > It turns out that the taxonomy index uses its own index based counter (the > {{{}taxonomyIndexEpoch{}}}) to determine if the index was opened in write > mode after the last time it was written and if not, it directly tries to > reuse the previous {{taxoArrays}} it had created. This logic fails in a > scenario where both the old and new index were opened just once but the index > itself is completely different in both the cases. > In such a case, it would be good to give the user the flexibility to inform > the DTR to recreate its {{{}taxoArrays{}}}, {{ordinalCache}} and > {{{}categoryCache{}}} (not refreshing these arrays causes it to fail in > various ways). Luckily, such a constructor already exists! But it is private > today! The idea here is to allow subclasses of DTR to use this constructor. > Curious to see what other folks think about this idea. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #811: Add some basic tasks to help/workflow
dweiss commented on PR #811: URL: https://github.com/apache/lucene/pull/811#issuecomment-1100945965 Please, go ahead, @mocobeta. I'm on Easter holidays and with intermittent access to the internet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index
[ https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523449#comment-17523449 ] Mayya Sharipova commented on LUCENE-10518: -- [~dnhatn] Thanks for a great investigation! I agree with the idea for backporting the two-pase fields processing from Lucene9 to Lucene8, since in Lucene8 we already don't allow a field for example to have different types of doc values in different documents. About the second idea to introduce an option to skip checking field-infos consistency, I am not sure about this. We've introduced this consistency so that we can rely on it for different functionalities (for example sort optimization is based on the fact that a field is indexed with doc values and points across all documents that have this field; also see [original email| https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if we break this consistency check, we can't use these functionalities. You also suggested a workaround for this issue is to do a force merge on 8.x Lucene. So may be this workaround and a patch for the index #1 would be enough? > FieldInfos consistency check can refuse to open Lucene 8 index > -- > > Key: LUCENE-10518 > URL: https://issues.apache.org/jira/browse/LUCENE-10518 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 8.10.1 >Reporter: Nhat Nguyen >Priority: Major > > A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can > refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if > hitting a non-aborting exception (for example [term is too > long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944]) > during processing fields of a document. We don't have this problem in Lucene > 9 as we process fields in two phases with the [first > phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614] > processing only FieldInfos. > The issue can be reproduced with this snippet. > {code:java} > public void testWriteIndexOn8x() throws Exception { > FieldType KeywordField = new FieldType(); > KeywordField.setTokenized(false); > KeywordField.setOmitNorms(true); > KeywordField.setIndexOptions(IndexOptions.DOCS); > KeywordField.freeze(); > try (Directory dir = newDirectory()) { > IndexWriterConfig config = new IndexWriterConfig(); > config.setCommitOnClose(false); > config.setMergePolicy(NoMergePolicy.INSTANCE); > try (IndexWriter writer = new IndexWriter(dir, config)) { > // first segment > writer.addDocument(new Document()); // an empty doc > Document d1 = new Document(); > byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1]; > Arrays.fill(chars, (byte) 'a'); > d1.add(new Field("field", new BytesRef(chars), KeywordField)); > d1.add(new BinaryDocValuesField("field", new BytesRef(chars))); > expectThrows(IllegalArgumentException.class, () -> > writer.addDocument(d1)); > writer.flush(); > // second segment > Document d2 = new Document(); > d2.add(new Field("field", new BytesRef("hello world"), KeywordField)); > d2.add(new SortedDocValuesField("field", new BytesRef("hello world"))); > writer.addDocument(d2); > writer.flush(); > writer.commit(); > // Check for doc values types consistency > Map docValuesTypes = new HashMap<>(); > try(DirectoryReader reader = DirectoryReader.open(dir)){ > for (LeafReaderContext leaf : reader.leaves()) { > for (FieldInfo fi : leaf.reader().getFieldInfos()) { > DocValuesType current = docValuesTypes.putIfAbsent(fi.name, > fi.getDocValuesType()); > if (current != null && current != fi.getDocValuesType()) { > fail("cannot change DocValues type from " + current + " to " + > fi.getDocValuesType() + " for field \"" + fi.name + "\""); > } > } > } > } > } > } > } > {code} > I would like to propose to: > - Backport the two-phase fields processing from Lucene9 to Lucene8. The patch > should be small and contained. > - Introduce an option in Lucene9 to skip checking field-infos consistency > (i.e., behave like Lucene 8 when the option is enabled). > /cc [~mayya] and [~jpountz] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index
[ https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523449#comment-17523449 ] Mayya Sharipova edited comment on LUCENE-10518 at 4/17/22 10:17 PM: [~dnhatn] Thanks for a great investigation! I agree with the idea for backporting the two-pase fields processing from Lucene9 to Lucene8, since in Lucene8 we already don't allow a field for example to have different types of doc values in different documents. About the second idea to introduce an option to skip checking field-infos consistency, I am not sure about this. We've introduced this consistency so that we can rely on it for different functionalities (for example sort optimization is based on the fact that a field is indexed with doc values and points across all documents that have this field; also see [original email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if we break this consistency check, we can't use these functionalities. You also suggested a workaround for this issue is to do a force merge on 8.x Lucene. So may be this workaround and backporting the two-pase fields processing to 8.x shoud be enough? was (Author: mayya): [~dnhatn] Thanks for a great investigation! I agree with the idea for backporting the two-pase fields processing from Lucene9 to Lucene8, since in Lucene8 we already don't allow a field for example to have different types of doc values in different documents. About the second idea to introduce an option to skip checking field-infos consistency, I am not sure about this. We've introduced this consistency so that we can rely on it for different functionalities (for example sort optimization is based on the fact that a field is indexed with doc values and points across all documents that have this field; also see [original email| https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if we break this consistency check, we can't use these functionalities. You also suggested a workaround for this issue is to do a force merge on 8.x Lucene. So may be this workaround and a patch for the index #1 would be enough? > FieldInfos consistency check can refuse to open Lucene 8 index > -- > > Key: LUCENE-10518 > URL: https://issues.apache.org/jira/browse/LUCENE-10518 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 8.10.1 >Reporter: Nhat Nguyen >Priority: Major > > A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can > refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if > hitting a non-aborting exception (for example [term is too > long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944]) > during processing fields of a document. We don't have this problem in Lucene > 9 as we process fields in two phases with the [first > phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614] > processing only FieldInfos. > The issue can be reproduced with this snippet. > {code:java} > public void testWriteIndexOn8x() throws Exception { > FieldType KeywordField = new FieldType(); > KeywordField.setTokenized(false); > KeywordField.setOmitNorms(true); > KeywordField.setIndexOptions(IndexOptions.DOCS); > KeywordField.freeze(); > try (Directory dir = newDirectory()) { > IndexWriterConfig config = new IndexWriterConfig(); > config.setCommitOnClose(false); > config.setMergePolicy(NoMergePolicy.INSTANCE); > try (IndexWriter writer = new IndexWriter(dir, config)) { > // first segment > writer.addDocument(new Document()); // an empty doc > Document d1 = new Document(); > byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1]; > Arrays.fill(chars, (byte) 'a'); > d1.add(new Field("field", new BytesRef(chars), KeywordField)); > d1.add(new BinaryDocValuesField("field", new BytesRef(chars))); > expectThrows(IllegalArgumentException.class, () -> > writer.addDocument(d1)); > writer.flush(); > // second segment > Document d2 = new Document(); > d2.add(new Field("field", new BytesRef("hello world"), KeywordField)); > d2.add(new SortedDocValuesField("field", new BytesRef("hello world"))); > writer.addDocument(d2); > writer.flush(); > writer.commit(); > // Check for doc values types consistency > Map docValuesTypes = new HashMap<>(); > try(DirectoryReader reader = DirectoryReader.open(dir)){ > for (LeafReaderContext leaf
[jira] [Comment Edited] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index
[ https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523449#comment-17523449 ] Mayya Sharipova edited comment on LUCENE-10518 at 4/17/22 10:35 PM: [~dnhatn] Thanks for a great investigation! I agree with the idea for backporting the two-pase fields processing from Lucene9 to Lucene8, since in Lucene8 we already don't allow a field for example to have different types of doc values in different documents. About the second idea to introduce an option to skip checking field-infos consistency, I am not sure about this. We've introduced this consistency so that we can rely on it for different functionalities (for example sort optimization is based on the fact that a field is indexed with doc values and points across all documents that have this field; FieldExistsQuery replied on it; also see [original email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if we break this consistency check, we can't use these functionalities. You also suggested a workaround for this issue is to do a force merge on 8.x Lucene. So may be this workaround and backporting the two-pase fields processing to 8.x shoud be enough? was (Author: mayya): [~dnhatn] Thanks for a great investigation! I agree with the idea for backporting the two-pase fields processing from Lucene9 to Lucene8, since in Lucene8 we already don't allow a field for example to have different types of doc values in different documents. About the second idea to introduce an option to skip checking field-infos consistency, I am not sure about this. We've introduced this consistency so that we can rely on it for different functionalities (for example sort optimization is based on the fact that a field is indexed with doc values and points across all documents that have this field; also see [original email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if we break this consistency check, we can't use these functionalities. You also suggested a workaround for this issue is to do a force merge on 8.x Lucene. So may be this workaround and backporting the two-pase fields processing to 8.x shoud be enough? > FieldInfos consistency check can refuse to open Lucene 8 index > -- > > Key: LUCENE-10518 > URL: https://issues.apache.org/jira/browse/LUCENE-10518 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 8.10.1 >Reporter: Nhat Nguyen >Priority: Major > > A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can > refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if > hitting a non-aborting exception (for example [term is too > long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944]) > during processing fields of a document. We don't have this problem in Lucene > 9 as we process fields in two phases with the [first > phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614] > processing only FieldInfos. > The issue can be reproduced with this snippet. > {code:java} > public void testWriteIndexOn8x() throws Exception { > FieldType KeywordField = new FieldType(); > KeywordField.setTokenized(false); > KeywordField.setOmitNorms(true); > KeywordField.setIndexOptions(IndexOptions.DOCS); > KeywordField.freeze(); > try (Directory dir = newDirectory()) { > IndexWriterConfig config = new IndexWriterConfig(); > config.setCommitOnClose(false); > config.setMergePolicy(NoMergePolicy.INSTANCE); > try (IndexWriter writer = new IndexWriter(dir, config)) { > // first segment > writer.addDocument(new Document()); // an empty doc > Document d1 = new Document(); > byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1]; > Arrays.fill(chars, (byte) 'a'); > d1.add(new Field("field", new BytesRef(chars), KeywordField)); > d1.add(new BinaryDocValuesField("field", new BytesRef(chars))); > expectThrows(IllegalArgumentException.class, () -> > writer.addDocument(d1)); > writer.flush(); > // second segment > Document d2 = new Document(); > d2.add(new Field("field", new BytesRef("hello world"), KeywordField)); > d2.add(new SortedDocValuesField("field", new BytesRef("hello world"))); > writer.addDocument(d2); > writer.flush(); > writer.commit(); > // Check for doc values types consistency > Map docValuesTypes = new HashMap<>(); > try(DirectoryReader reader = Director
[jira] [Comment Edited] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index
[ https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523449#comment-17523449 ] Mayya Sharipova edited comment on LUCENE-10518 at 4/17/22 10:36 PM: [~dnhatn] Thanks for a great investigation! I agree with the idea for backporting the two-pase fields processing from Lucene9 to Lucene8, since in Lucene8 we already don't allow a field for example to have different types of doc values in different documents. About the second idea to introduce an option to skip checking field-infos consistency, I am not sure about this. We've introduced this consistency so that we can rely on it for different functionalities (for example sort optimization is based on the fact that a field is indexed with doc values and points across all documents that have this field; also FieldExistsQuery relies on it; also see [original email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. ) So if we break this consistency check, we can't use these functionalities. You also suggested a workaround for this issue is to do a force merge on 8.x Lucene. So may be this workaround and backporting the two-pase fields processing to 8.x shoud be enough? was (Author: mayya): [~dnhatn] Thanks for a great investigation! I agree with the idea for backporting the two-pase fields processing from Lucene9 to Lucene8, since in Lucene8 we already don't allow a field for example to have different types of doc values in different documents. About the second idea to introduce an option to skip checking field-infos consistency, I am not sure about this. We've introduced this consistency so that we can rely on it for different functionalities (for example sort optimization is based on the fact that a field is indexed with doc values and points across all documents that have this field; also FieldExistsQuery relies on it; also see [original email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if we break this consistency check, we can't use these functionalities. You also suggested a workaround for this issue is to do a force merge on 8.x Lucene. So may be this workaround and backporting the two-pase fields processing to 8.x shoud be enough? > FieldInfos consistency check can refuse to open Lucene 8 index > -- > > Key: LUCENE-10518 > URL: https://issues.apache.org/jira/browse/LUCENE-10518 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 8.10.1 >Reporter: Nhat Nguyen >Priority: Major > > A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can > refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if > hitting a non-aborting exception (for example [term is too > long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944]) > during processing fields of a document. We don't have this problem in Lucene > 9 as we process fields in two phases with the [first > phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614] > processing only FieldInfos. > The issue can be reproduced with this snippet. > {code:java} > public void testWriteIndexOn8x() throws Exception { > FieldType KeywordField = new FieldType(); > KeywordField.setTokenized(false); > KeywordField.setOmitNorms(true); > KeywordField.setIndexOptions(IndexOptions.DOCS); > KeywordField.freeze(); > try (Directory dir = newDirectory()) { > IndexWriterConfig config = new IndexWriterConfig(); > config.setCommitOnClose(false); > config.setMergePolicy(NoMergePolicy.INSTANCE); > try (IndexWriter writer = new IndexWriter(dir, config)) { > // first segment > writer.addDocument(new Document()); // an empty doc > Document d1 = new Document(); > byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1]; > Arrays.fill(chars, (byte) 'a'); > d1.add(new Field("field", new BytesRef(chars), KeywordField)); > d1.add(new BinaryDocValuesField("field", new BytesRef(chars))); > expectThrows(IllegalArgumentException.class, () -> > writer.addDocument(d1)); > writer.flush(); > // second segment > Document d2 = new Document(); > d2.add(new Field("field", new BytesRef("hello world"), KeywordField)); > d2.add(new SortedDocValuesField("field", new BytesRef("hello world"))); > writer.addDocument(d2); > writer.flush(); > writer.commit(); > // Check for doc values types consistency > Map docValuesTypes = new HashMap<>();
[jira] [Comment Edited] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index
[ https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523449#comment-17523449 ] Mayya Sharipova edited comment on LUCENE-10518 at 4/17/22 10:36 PM: [~dnhatn] Thanks for a great investigation! I agree with the idea for backporting the two-pase fields processing from Lucene9 to Lucene8, since in Lucene8 we already don't allow a field for example to have different types of doc values in different documents. About the second idea to introduce an option to skip checking field-infos consistency, I am not sure about this. We've introduced this consistency so that we can rely on it for different functionalities (for example sort optimization is based on the fact that a field is indexed with doc values and points across all documents that have this field; also FieldExistsQuery relies on it; also see [original email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if we break this consistency check, we can't use these functionalities. You also suggested a workaround for this issue is to do a force merge on 8.x Lucene. So may be this workaround and backporting the two-pase fields processing to 8.x shoud be enough? was (Author: mayya): [~dnhatn] Thanks for a great investigation! I agree with the idea for backporting the two-pase fields processing from Lucene9 to Lucene8, since in Lucene8 we already don't allow a field for example to have different types of doc values in different documents. About the second idea to introduce an option to skip checking field-infos consistency, I am not sure about this. We've introduced this consistency so that we can rely on it for different functionalities (for example sort optimization is based on the fact that a field is indexed with doc values and points across all documents that have this field; FieldExistsQuery replied on it; also see [original email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if we break this consistency check, we can't use these functionalities. You also suggested a workaround for this issue is to do a force merge on 8.x Lucene. So may be this workaround and backporting the two-pase fields processing to 8.x shoud be enough? > FieldInfos consistency check can refuse to open Lucene 8 index > -- > > Key: LUCENE-10518 > URL: https://issues.apache.org/jira/browse/LUCENE-10518 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 8.10.1 >Reporter: Nhat Nguyen >Priority: Major > > A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can > refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if > hitting a non-aborting exception (for example [term is too > long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944]) > during processing fields of a document. We don't have this problem in Lucene > 9 as we process fields in two phases with the [first > phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614] > processing only FieldInfos. > The issue can be reproduced with this snippet. > {code:java} > public void testWriteIndexOn8x() throws Exception { > FieldType KeywordField = new FieldType(); > KeywordField.setTokenized(false); > KeywordField.setOmitNorms(true); > KeywordField.setIndexOptions(IndexOptions.DOCS); > KeywordField.freeze(); > try (Directory dir = newDirectory()) { > IndexWriterConfig config = new IndexWriterConfig(); > config.setCommitOnClose(false); > config.setMergePolicy(NoMergePolicy.INSTANCE); > try (IndexWriter writer = new IndexWriter(dir, config)) { > // first segment > writer.addDocument(new Document()); // an empty doc > Document d1 = new Document(); > byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1]; > Arrays.fill(chars, (byte) 'a'); > d1.add(new Field("field", new BytesRef(chars), KeywordField)); > d1.add(new BinaryDocValuesField("field", new BytesRef(chars))); > expectThrows(IllegalArgumentException.class, () -> > writer.addDocument(d1)); > writer.flush(); > // second segment > Document d2 = new Document(); > d2.add(new Field("field", new BytesRef("hello world"), KeywordField)); > d2.add(new SortedDocValuesField("field", new BytesRef("hello world"))); > writer.addDocument(d2); > writer.flush(); > writer.commit(); > // Check for doc values types consistency > Map docValuesTypes = new HashMap<>(); >
[jira] [Comment Edited] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index
[ https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523449#comment-17523449 ] Mayya Sharipova edited comment on LUCENE-10518 at 4/17/22 10:37 PM: [~dnhatn] Thanks for a great investigation! I agree with the idea for backporting the two-pase fields processing from Lucene9 to Lucene8, since in Lucene8 we already don't allow a field for example to have different types of doc values in different documents. About the second idea to introduce an option to skip checking field-infos consistency, I am not sure about this. We've introduced this consistency so that we can rely on it for different functionalities (for example sort optimization is based on the fact that a field is indexed with doc values and points across all documents that have this field; also FieldExistsQuery relies on it; also see [original email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. ) So if we break this consistency check, we can't use these functionalities anymore. You also suggested a workaround for this issue is to do a force merge on 8.x Lucene. So may be this workaround and backporting the two-pase fields processing to 8.x shoud be enough? was (Author: mayya): [~dnhatn] Thanks for a great investigation! I agree with the idea for backporting the two-pase fields processing from Lucene9 to Lucene8, since in Lucene8 we already don't allow a field for example to have different types of doc values in different documents. About the second idea to introduce an option to skip checking field-infos consistency, I am not sure about this. We've introduced this consistency so that we can rely on it for different functionalities (for example sort optimization is based on the fact that a field is indexed with doc values and points across all documents that have this field; also FieldExistsQuery relies on it; also see [original email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. ) So if we break this consistency check, we can't use these functionalities. You also suggested a workaround for this issue is to do a force merge on 8.x Lucene. So may be this workaround and backporting the two-pase fields processing to 8.x shoud be enough? > FieldInfos consistency check can refuse to open Lucene 8 index > -- > > Key: LUCENE-10518 > URL: https://issues.apache.org/jira/browse/LUCENE-10518 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 8.10.1 >Reporter: Nhat Nguyen >Priority: Major > > A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can > refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if > hitting a non-aborting exception (for example [term is too > long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944]) > during processing fields of a document. We don't have this problem in Lucene > 9 as we process fields in two phases with the [first > phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614] > processing only FieldInfos. > The issue can be reproduced with this snippet. > {code:java} > public void testWriteIndexOn8x() throws Exception { > FieldType KeywordField = new FieldType(); > KeywordField.setTokenized(false); > KeywordField.setOmitNorms(true); > KeywordField.setIndexOptions(IndexOptions.DOCS); > KeywordField.freeze(); > try (Directory dir = newDirectory()) { > IndexWriterConfig config = new IndexWriterConfig(); > config.setCommitOnClose(false); > config.setMergePolicy(NoMergePolicy.INSTANCE); > try (IndexWriter writer = new IndexWriter(dir, config)) { > // first segment > writer.addDocument(new Document()); // an empty doc > Document d1 = new Document(); > byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1]; > Arrays.fill(chars, (byte) 'a'); > d1.add(new Field("field", new BytesRef(chars), KeywordField)); > d1.add(new BinaryDocValuesField("field", new BytesRef(chars))); > expectThrows(IllegalArgumentException.class, () -> > writer.addDocument(d1)); > writer.flush(); > // second segment > Document d2 = new Document(); > d2.add(new Field("field", new BytesRef("hello world"), KeywordField)); > d2.add(new SortedDocValuesField("field", new BytesRef("hello world"))); > writer.addDocument(d2); > writer.flush(); > writer.commit(); > // Check for doc values types consistency > Map docValuesTypes = new Has
[jira] [Commented] (LUCENE-10482) Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide
[ https://issues.apache.org/jira/browse/LUCENE-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523464#comment-17523464 ] Gautam Worah commented on LUCENE-10482: --- Got it. My apologies for introducing this error. I had switched to using Instant.now() during testing because when the single test case was failing midway it was not able to delete the temporary folder. Subsequent runs of the test would fail because a folder with the particular name was already existing. Submitting a fix right now > Allow users to create their own DirectoryTaxonomyReaders with empty > taxoArrays instead of letting the taxoEpoch decide > -- > > Key: LUCENE-10482 > URL: https://issues.apache.org/jira/browse/LUCENE-10482 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 9.1 >Reporter: Gautam Worah >Priority: Minor > Time Spent: 6h 50m > Remaining Estimate: 0h > > I was experimenting with the taxonomy index and {{DirectoryTaxonomyReaders}} > in my day job where we were trying to replace the index underneath a reader > asynchronously and then call the {{doOpenIfChanged}} call on it. > It turns out that the taxonomy index uses its own index based counter (the > {{{}taxonomyIndexEpoch{}}}) to determine if the index was opened in write > mode after the last time it was written and if not, it directly tries to > reuse the previous {{taxoArrays}} it had created. This logic fails in a > scenario where both the old and new index were opened just once but the index > itself is completely different in both the cases. > In such a case, it would be good to give the user the flexibility to inform > the DTR to recreate its {{{}taxoArrays{}}}, {{ordinalCache}} and > {{{}categoryCache{}}} (not refreshing these arrays causes it to fail in > various ways). Luckily, such a constructor already exists! But it is private > today! The idea here is to allow subclasses of DTR to use this constructor. > Curious to see what other folks think about this idea. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch d
gautamworah96 commented on code in PR #762: URL: https://github.com/apache/lucene/pull/762#discussion_r851814501 ## lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestAlwaysRefreshDirectoryTaxonomyReader.java: ## @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.taxonomy.directory; + +import static com.carrotsearch.randomizedtesting.RandomizedTest.sleep; +import static org.apache.lucene.tests.mockfile.ExtrasFS.isExtra; + +import java.io.IOException; +import java.nio.file.Path; +import java.time.Instant; +import java.util.List; +import java.util.function.Function; +import org.apache.lucene.facet.FacetTestCase; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.FacetsConfig; +import org.apache.lucene.facet.taxonomy.FacetLabel; +import org.apache.lucene.facet.taxonomy.SearcherTaxonomyManager; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.util.IOUtils; + +// Nefarious FS will delay/stop deletion of index files and this test specifically does that +@LuceneTestCase.SuppressFileSystems({"WindowsFS", "VirusCheckingFS"}) +public class TestAlwaysRefreshDirectoryTaxonomyReader extends FacetTestCase { + + /** + * Tests the behavior of the {@link AlwaysRefreshDirectoryTaxonomyReader} by testing if the + * associated {@link SearcherTaxonomyManager} can successfully refresh and serve queries if the + * underlying taxonomy index is changed to an older checkpoint. Ideally, each checkpoint should be + * self-sufficient and should allow serving search queries when {@link + * SearcherTaxonomyManager#maybeRefresh()} is called. + * + * It does not check whether the private taxoArrays were actually recreated or no. We are + * (correctly) hiding away that complexity away from the user. + */ + private void testAlwaysRefreshDirectoryTaxonomyReader( + Function dtrProducer, Class exceptionType) + throws IOException { +final Path taxoPath1 = createTempDir(String.valueOf(Instant.now())); Review Comment: Apologies for this. I've replied on the JIRA thread as well. I had switched to using Instant.now() during testing because when the single test case was failing midway it was not able to delete the temporary folder. Subsequent runs of the test would fail because a folder with the particular name was already existing. Non reproducibility is also an issue +1. Submitting a fix right now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 opened a new pull request, #814: LUCENE-10482 Bug Fix: Don't use Instant.now() as prefix for the temp dir name
gautamworah96 opened a new pull request, #814: URL: https://github.com/apache/lucene/pull/814 Tests in Windows are failing because of an invalid `:` character in the directory name. Full stack trace and the comment which detected the issue: https://github.com/apache/lucene/pull/762#issuecomment-1100941707 Previous PR: https://github.com/apache/lucene/pull/762 # Solution Rely on the default implementation of `createTempDir` # Tests Multiple runs of the `testAlwaysRefreshDirectoryTaxonomyReader` test through the IDE on a Mac. --- I'll also submit a fix for the backported commit once this one gets merged. Apologies again. cc: @uschindler @mikemccand -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch d
gautamworah96 commented on code in PR #762: URL: https://github.com/apache/lucene/pull/762#discussion_r851816491 ## lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestAlwaysRefreshDirectoryTaxonomyReader.java: ## @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.taxonomy.directory; + +import static com.carrotsearch.randomizedtesting.RandomizedTest.sleep; +import static org.apache.lucene.tests.mockfile.ExtrasFS.isExtra; + +import java.io.IOException; +import java.nio.file.Path; +import java.time.Instant; +import java.util.List; +import java.util.function.Function; +import org.apache.lucene.facet.FacetTestCase; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.FacetsConfig; +import org.apache.lucene.facet.taxonomy.FacetLabel; +import org.apache.lucene.facet.taxonomy.SearcherTaxonomyManager; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.util.IOUtils; + +// Nefarious FS will delay/stop deletion of index files and this test specifically does that +@LuceneTestCase.SuppressFileSystems({"WindowsFS", "VirusCheckingFS"}) +public class TestAlwaysRefreshDirectoryTaxonomyReader extends FacetTestCase { + + /** + * Tests the behavior of the {@link AlwaysRefreshDirectoryTaxonomyReader} by testing if the + * associated {@link SearcherTaxonomyManager} can successfully refresh and serve queries if the + * underlying taxonomy index is changed to an older checkpoint. Ideally, each checkpoint should be + * self-sufficient and should allow serving search queries when {@link + * SearcherTaxonomyManager#maybeRefresh()} is called. + * + * It does not check whether the private taxoArrays were actually recreated or no. We are + * (correctly) hiding away that complexity away from the user. + */ + private void testAlwaysRefreshDirectoryTaxonomyReader( + Function dtrProducer, Class exceptionType) + throws IOException { +final Path taxoPath1 = createTempDir(String.valueOf(Instant.now())); Review Comment: Created https://github.com/apache/lucene/pull/814 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand merged pull request #814: LUCENE-10482 Bug Fix: Don't use Instant.now() as prefix for the temp dir name
mikemccand merged PR #814: URL: https://github.com/apache/lucene/pull/814 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10482) Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide
[ https://issues.apache.org/jira/browse/LUCENE-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523472#comment-17523472 ] ASF subversion and git services commented on LUCENE-10482: -- Commit d322be52f2407419c9fff4afee84ce4c87fe018d in lucene's branch refs/heads/main from Gautam Worah [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d322be52f24 ] LUCENE-10482 Bug Fix: Don't use Instant.now() as prefix for the temp dir name (#814) * Don't use Instant.now() as prefix for the temp dir name * spotless > Allow users to create their own DirectoryTaxonomyReaders with empty > taxoArrays instead of letting the taxoEpoch decide > -- > > Key: LUCENE-10482 > URL: https://issues.apache.org/jira/browse/LUCENE-10482 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 9.1 >Reporter: Gautam Worah >Priority: Minor > Time Spent: 7.5h > Remaining Estimate: 0h > > I was experimenting with the taxonomy index and {{DirectoryTaxonomyReaders}} > in my day job where we were trying to replace the index underneath a reader > asynchronously and then call the {{doOpenIfChanged}} call on it. > It turns out that the taxonomy index uses its own index based counter (the > {{{}taxonomyIndexEpoch{}}}) to determine if the index was opened in write > mode after the last time it was written and if not, it directly tries to > reuse the previous {{taxoArrays}} it had created. This logic fails in a > scenario where both the old and new index were opened just once but the index > itself is completely different in both the cases. > In such a case, it would be good to give the user the flexibility to inform > the DTR to recreate its {{{}taxoArrays{}}}, {{ordinalCache}} and > {{{}categoryCache{}}} (not refreshing these arrays causes it to fail in > various ways). Luckily, such a constructor already exists! But it is private > today! The idea here is to allow subclasses of DTR to use this constructor. > Curious to see what other folks think about this idea. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #811: Add some basic tasks to help/workflow
mocobeta commented on PR #811: URL: https://github.com/apache/lucene/pull/811#issuecomment-1101035368 Thanks and apologies for interrupting your holidays, I'll merge this soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta merged pull request #811: Add some basic tasks to help/workflow
mocobeta merged PR #811: URL: https://github.com/apache/lucene/pull/811 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #805: LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori
mocobeta commented on code in PR #805: URL: https://github.com/apache/lucene/pull/805#discussion_r851846785 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/ViterbiNBest.java: ## @@ -0,0 +1,1253 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.EnumMap; +import java.util.HashMap; +import java.util.List; +import org.apache.lucene.analysis.ja.dict.CharacterDefinition; +import org.apache.lucene.analysis.ja.dict.JaMorphData; +import org.apache.lucene.analysis.ja.dict.TokenInfoDictionary; +import org.apache.lucene.analysis.ja.dict.UnknownDictionary; +import org.apache.lucene.analysis.ja.dict.UserDictionary; +import org.apache.lucene.analysis.morph.ConnectionCosts; +import org.apache.lucene.analysis.morph.Dictionary; +import org.apache.lucene.analysis.morph.GraphvizFormatter; +import org.apache.lucene.analysis.morph.TokenInfoFST; +import org.apache.lucene.analysis.morph.TokenType; +import org.apache.lucene.analysis.morph.Viterbi; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.fst.FST; + +/** + * {@link org.apache.lucene.analysis.morph.Viterbi} subclass for Japanese morphological analysis. + * This also performs n-best path calculation + */ +final class ViterbiNBest +extends org.apache.lucene.analysis.morph.Viterbi { + + private final EnumMap> dictionaryMap = + new EnumMap<>(TokenType.class); + + private final UnknownDictionary unkDictionary; + private final CharacterDefinition characterDefinition; + private final UserDictionary userDictionary; + + private final boolean discardPunctuation; + private final boolean searchMode; + private final boolean extendedMode; + private final boolean outputCompounds; + + // Allowable cost difference for N-best output: + private int nBestCost = 0; + + Lattice lattice = null; + + private GraphvizFormatter dotOut; + + ViterbiNBest( + TokenInfoFST fst, + FST.BytesReader fstReader, + TokenInfoDictionary dictionary, + TokenInfoFST userFST, + FST.BytesReader userFSTReader, + UserDictionary userDictionary, + ConnectionCosts costs, + Class positionImpl, + UnknownDictionary unkDictionary, + CharacterDefinition characterDefinition, + boolean discardPunctuation, + boolean searchMode, + boolean extendedMode, + boolean outputCompounds) { +super(fst, fstReader, dictionary, userFST, userFSTReader, userDictionary, costs, positionImpl); +this.unkDictionary = unkDictionary; +this.characterDefinition = characterDefinition; +this.userDictionary = userDictionary; +this.discardPunctuation = discardPunctuation; +this.searchMode = searchMode; +this.extendedMode = extendedMode; +this.outputCompounds = outputCompounds; +dictionaryMap.put(TokenType.KNOWN, dictionary); +dictionaryMap.put(TokenType.UNKNOWN, unkDictionary); +dictionaryMap.put(TokenType.USER, userDictionary); + } + + @Override + protected boolean shouldSkipProcessUnknownWord(int unknownWordEndIndex, Position posData) { +return !searchMode && super.shouldSkipProcessUnknownWord(unknownWordEndIndex, posData); + } + + private static final int SEARCH_MODE_KANJI_LENGTH = 2; + private static final int SEARCH_MODE_OTHER_LENGTH = 7; // Must be >= SEARCH_MODE_KANJI_LENGTH + private static final int SEARCH_MODE_KANJI_PENALTY = 3000; + private static final int SEARCH_MODE_OTHER_PENALTY = 1700; + + @Override + protected int computePenalty(int pos, int length) throws IOException { +if (length > SEARCH_MODE_KANJI_LENGTH) { + boolean allKanji = true; + // check if node consists of only kanji + final int endPos = pos + length; + for (int pos2 = pos; pos2 < endPos; pos2++) { +if (!characterDefinition.isKanji((char) buffer.get(pos2))) { + allKanji = false; + break; +} + } + if (allKanji) { // Process only Kanji keywords +return (length - SEARCH_MODE_KANJI_LENGTH) * SEARCH_MODE_KANJI_PENALTY; + } else if (length > SEARCH_MODE_OTHER_LENGTH)
[GitHub] [lucene] gautamworah96 opened a new pull request, #815: Backport LUCENE-10482 Bug Fix: Don't use Instant.now() as prefix for the temp dir name
gautamworah96 opened a new pull request, #815: URL: https://github.com/apache/lucene/pull/815 Backport of this PR: https://github.com/apache/lucene/pull/814 Tests in Windows are failing because of an invalid : character in the directory name. Full stack trace and the comment which detected the issue: https://github.com/apache/lucene/pull/762#issuecomment-1100941707 Previous PR which introduced the bug in `main`: https://github.com/apache/lucene/pull/762 Sorry for all the trouble this caused! ### Test Ran `testAlwaysRefreshDirectoryTaxonomyReader` multiple times through the IDE on a Mac cc @uschindler @mikemccand -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #815: Backport LUCENE-10482 Bug Fix: Don't use Instant.now() as prefix for the temp dir name
uschindler commented on code in PR #815: URL: https://github.com/apache/lucene/pull/815#discussion_r851923047 ## lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestAlwaysRefreshDirectoryTaxonomyReader.java: ## @@ -53,15 +52,15 @@ public class TestAlwaysRefreshDirectoryTaxonomyReader extends FacetTestCase { private void testAlwaysRefreshDirectoryTaxonomyReader( Function dtrProducer, Class exceptionType) throws IOException { -final Path taxoPath1 = createTempDir(String.valueOf(Instant.now())); +final Path taxoPath1 = createTempDir(); Review Comment: You could still give some static string as prefix, but as this class only has one test, we don't need it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org