[jira] [Commented] (LUCENE-10482) Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide

2022-04-17 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523347#comment-17523347
 ] 

Michael McCandless commented on LUCENE-10482:
-

Thanks [~gworah] -- can we resolve this now?

> Allow users to create their own DirectoryTaxonomyReaders with empty 
> taxoArrays instead of letting the taxoEpoch decide
> --
>
> Key: LUCENE-10482
> URL: https://issues.apache.org/jira/browse/LUCENE-10482
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 9.1
>Reporter: Gautam Worah
>Priority: Minor
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> I was experimenting with the taxonomy index and {{DirectoryTaxonomyReaders}} 
> in my day job where we were trying to replace the index underneath a reader 
> asynchronously and then call the {{doOpenIfChanged}} call on it.
> It turns out that the taxonomy index uses its own index based counter (the 
> {{{}taxonomyIndexEpoch{}}}) to determine if the index was opened in write 
> mode after the last time it was written and if not, it directly tries to 
> reuse the previous {{taxoArrays}} it had created. This logic fails in a 
> scenario where both the old and new index were opened just once but the index 
> itself is completely different in both the cases.
> In such a case, it would be good to give the user the flexibility to inform 
> the DTR to recreate its {{{}taxoArrays{}}}, {{ordinalCache}} and 
> {{{}categoryCache{}}} (not refreshing these arrays causes it to fail in 
> various ways). Luckily, such a constructor already exists! But it is private 
> today! The idea here is to allow subclasses of DTR to use this constructor.
> Curious to see what other folks think about this idea. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a diff in pull request #805: LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori

2022-04-17 Thread GitBox


mikemccand commented on code in PR #805:
URL: https://github.com/apache/lucene/pull/805#discussion_r851776382


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/ViterbiNBest.java:
##
@@ -0,0 +1,1253 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.EnumMap;
+import java.util.HashMap;
+import java.util.List;
+import org.apache.lucene.analysis.ja.dict.CharacterDefinition;
+import org.apache.lucene.analysis.ja.dict.JaMorphData;
+import org.apache.lucene.analysis.ja.dict.TokenInfoDictionary;
+import org.apache.lucene.analysis.ja.dict.UnknownDictionary;
+import org.apache.lucene.analysis.ja.dict.UserDictionary;
+import org.apache.lucene.analysis.morph.ConnectionCosts;
+import org.apache.lucene.analysis.morph.Dictionary;
+import org.apache.lucene.analysis.morph.GraphvizFormatter;
+import org.apache.lucene.analysis.morph.TokenInfoFST;
+import org.apache.lucene.analysis.morph.TokenType;
+import org.apache.lucene.analysis.morph.Viterbi;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.fst.FST;
+
+/**
+ * {@link org.apache.lucene.analysis.morph.Viterbi} subclass for Japanese 
morphological analysis.
+ * This also performs n-best path calculation
+ */
+final class ViterbiNBest
+extends org.apache.lucene.analysis.morph.Viterbi {
+
+  private final EnumMap> 
dictionaryMap =
+  new EnumMap<>(TokenType.class);
+
+  private final UnknownDictionary unkDictionary;
+  private final CharacterDefinition characterDefinition;
+  private final UserDictionary userDictionary;
+
+  private final boolean discardPunctuation;
+  private final boolean searchMode;
+  private final boolean extendedMode;
+  private final boolean outputCompounds;
+
+  // Allowable cost difference for N-best output:
+  private int nBestCost = 0;
+
+  Lattice lattice = null;
+
+  private GraphvizFormatter dotOut;
+
+  ViterbiNBest(
+  TokenInfoFST fst,
+  FST.BytesReader fstReader,
+  TokenInfoDictionary dictionary,
+  TokenInfoFST userFST,
+  FST.BytesReader userFSTReader,
+  UserDictionary userDictionary,
+  ConnectionCosts costs,
+  Class positionImpl,
+  UnknownDictionary unkDictionary,
+  CharacterDefinition characterDefinition,
+  boolean discardPunctuation,
+  boolean searchMode,
+  boolean extendedMode,
+  boolean outputCompounds) {
+super(fst, fstReader, dictionary, userFST, userFSTReader, userDictionary, 
costs, positionImpl);
+this.unkDictionary = unkDictionary;
+this.characterDefinition = characterDefinition;
+this.userDictionary = userDictionary;
+this.discardPunctuation = discardPunctuation;
+this.searchMode = searchMode;
+this.extendedMode = extendedMode;
+this.outputCompounds = outputCompounds;
+dictionaryMap.put(TokenType.KNOWN, dictionary);
+dictionaryMap.put(TokenType.UNKNOWN, unkDictionary);
+dictionaryMap.put(TokenType.USER, userDictionary);
+  }
+
+  @Override
+  protected boolean shouldSkipProcessUnknownWord(int unknownWordEndIndex, 
Position posData) {
+return !searchMode && 
super.shouldSkipProcessUnknownWord(unknownWordEndIndex, posData);
+  }
+
+  private static final int SEARCH_MODE_KANJI_LENGTH = 2;
+  private static final int SEARCH_MODE_OTHER_LENGTH = 7; // Must be >= 
SEARCH_MODE_KANJI_LENGTH
+  private static final int SEARCH_MODE_KANJI_PENALTY = 3000;
+  private static final int SEARCH_MODE_OTHER_PENALTY = 1700;
+
+  @Override
+  protected int computePenalty(int pos, int length) throws IOException {
+if (length > SEARCH_MODE_KANJI_LENGTH) {
+  boolean allKanji = true;
+  // check if node consists of only kanji
+  final int endPos = pos + length;
+  for (int pos2 = pos; pos2 < endPos; pos2++) {
+if (!characterDefinition.isKanji((char) buffer.get(pos2))) {
+  allKanji = false;
+  break;
+}
+  }
+  if (allKanji) { // Process only Kanji keywords
+return (length - SEARCH_MODE_KANJI_LENGTH) * SEARCH_MODE_KANJI_PENALTY;
+  } else if (length > SEARCH_MODE_OTHER_LENGT

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-04-17 Thread GitBox


mayya-sharipova commented on code in PR #792:
URL: https://github.com/apache/lucene/pull/792#discussion_r851785737


##
lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsWriter.java:
##
@@ -0,0 +1,328 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene92;
+
+import static 
org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsFormat.DIRECT_MONOTONIC_BLOCK_SHIFT;
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.DocsWithFieldSet;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.IndexFileNames;
+import org.apache.lucene.index.RandomAccessVectorValuesProducer;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.hnsw.HnswGraph.NodesIterator;
+import org.apache.lucene.util.hnsw.HnswGraphBuilder;
+import org.apache.lucene.util.hnsw.NeighborArray;
+import org.apache.lucene.util.hnsw.OnHeapHnswGraph;
+import org.apache.lucene.util.packed.DirectMonotonicWriter;
+
+/**
+ * Writes vector values and knn graphs to index segments.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene92HnswVectorsWriter extends KnnVectorsWriter {
+
+  private final SegmentWriteState segmentWriteState;
+  private final IndexOutput meta, vectorData, vectorIndex;
+  private final int maxDoc;
+
+  private final int maxConn;
+  private final int beamWidth;
+  private boolean finished;
+
+  Lucene92HnswVectorsWriter(SegmentWriteState state, int maxConn, int 
beamWidth)
+  throws IOException {
+this.maxConn = maxConn;
+this.beamWidth = beamWidth;
+
+assert state.fieldInfos.hasVectorValues();
+segmentWriteState = state;
+
+String metaFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene92HnswVectorsFormat.META_EXTENSION);
+
+String vectorDataFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name,
+state.segmentSuffix,
+Lucene92HnswVectorsFormat.VECTOR_DATA_EXTENSION);
+
+String indexDataFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name,
+state.segmentSuffix,
+Lucene92HnswVectorsFormat.VECTOR_INDEX_EXTENSION);
+
+boolean success = false;
+try {
+  meta = state.directory.createOutput(metaFileName, state.context);
+  vectorData = state.directory.createOutput(vectorDataFileName, 
state.context);
+  vectorIndex = state.directory.createOutput(indexDataFileName, 
state.context);
+
+  CodecUtil.writeIndexHeader(
+  meta,
+  Lucene92HnswVectorsFormat.META_CODEC_NAME,
+  Lucene92HnswVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  CodecUtil.writeIndexHeader(
+  vectorData,
+  Lucene92HnswVectorsFormat.VECTOR_DATA_CODEC_NAME,
+  Lucene92HnswVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  CodecUtil.writeIndexHeader(
+  vectorIndex,
+  Lucene92HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME,
+  Lucene92HnswVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  maxDoc = state.segmentInfo.maxDoc();
+  success = true;
+} finally {
+  if (success == false) {
+IOUtils.closeWhileHandlingException(this);
+  }
+}
+  }
+
+  @Override
+  public void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+  throws IOException {
+   

[GitHub] [lucene] mayya-sharipova commented on pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-04-17 Thread GitBox


mayya-sharipova commented on PR #792:
URL: https://github.com/apache/lucene/pull/792#issuecomment-1100923586

   @LuXugang  as per @jtibshirani' 
[comment](https://github.com/apache/lucene/pull/792#pullrequestreview-940285463)
 , the main factor for deciding if we can use `DirectMonotonicWriter/Reader` is 
the performance measure. We first need to see performance results to confirm 
the searches are not slowed down.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide

2022-04-17 Thread GitBox


uschindler commented on PR #762:
URL: https://github.com/apache/lucene/pull/762#issuecomment-1100941707

   Hi,
   since this commit all Jenkins WIndows builds fail with this error:
   
   ```
   FAILED:  
org.apache.lucene.facet.taxonomy.directory.TestAlwaysRefreshDirectoryTaxonomyReader.testAlwaysRefreshDirectoryTaxonomyReader
   
   Error Message:
   java.nio.file.InvalidPathException: Illegal char <:> at index 13: 
2022-04-15T20:35:33.995886500Z-001
   
   Stack Trace:
   java.nio.file.InvalidPathException: Illegal char <:> at index 13: 
2022-04-15T20:35:33.995886500Z-001
at 
__randomizedtesting.SeedInfo.seed([D59CA6A5F246F40:34E0EFAFAAB6C033]:0)
at 
java.base/sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182)
at 
java.base/sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153)
at 
java.base/sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77)
at java.base/sun.nio.fs.WindowsPath.parse(WindowsPath.java:92)
at 
java.base/sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:232)
at java.base/java.nio.file.Path.resolve(Path.java:515)
at 
[org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156)](mailto:org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156))
at 
[org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156)](mailto:org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156))
at 
[org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156)](mailto:org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156))
at 
[org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156)](mailto:org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156))
at 
[org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156)](mailto:org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterPath.resolve(FilterPath.java:156))
at 
[org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleTemporaryFilesCleanup.createTempDir(TestRuleTemporaryFilesCleanup.java:284)](mailto:org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleTemporaryFilesCleanup.createTempDir(TestRuleTemporaryFilesCleanup.java:284))
at 
[org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.LuceneTestCase.createTempDir(LuceneTestCase.java:3076)](mailto:org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.LuceneTestCase.createTempDir(LuceneTestCase.java:3076))
at 
org.apache.lucene.facet.taxonomy.directory.TestAlwaysRefreshDirectoryTaxonomyReader.testAlwaysRefreshDirectoryTaxonomyReader(TestAlwaysRefreshDirectoryTaxonomyReader.java:56)
at 
org.apache.lucene.facet.taxonomy.directory.TestAlwaysRefreshDirectoryTaxonomyReader.testAlwaysRefreshDirectoryTaxonomyReader(TestAlwaysRefreshDirectoryTaxonomyReader.java:128)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at 
[randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)](mailto:randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754))
at 
[randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)](mailto:randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942))
at 
[randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)](mailto:randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978))
at 
[randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)](mailto:randomizedtesting.runner@2.7.6/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992))
at 
[org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.

[GitHub] [lucene] uschindler commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch deci

2022-04-17 Thread GitBox


uschindler commented on code in PR #762:
URL: https://github.com/apache/lucene/pull/762#discussion_r851798738


##
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestAlwaysRefreshDirectoryTaxonomyReader.java:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy.directory;
+
+import static com.carrotsearch.randomizedtesting.RandomizedTest.sleep;
+import static org.apache.lucene.tests.mockfile.ExtrasFS.isExtra;
+
+import java.io.IOException;
+import java.nio.file.Path;
+import java.time.Instant;
+import java.util.List;
+import java.util.function.Function;
+import org.apache.lucene.facet.FacetTestCase;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.FacetsConfig;
+import org.apache.lucene.facet.taxonomy.FacetLabel;
+import org.apache.lucene.facet.taxonomy.SearcherTaxonomyManager;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.util.IOUtils;
+
+// Nefarious FS will delay/stop deletion of index files and this test 
specifically does that
+@LuceneTestCase.SuppressFileSystems({"WindowsFS", "VirusCheckingFS"})
+public class TestAlwaysRefreshDirectoryTaxonomyReader extends FacetTestCase {
+
+  /**
+   * Tests the behavior of the {@link AlwaysRefreshDirectoryTaxonomyReader} by 
testing if the
+   * associated {@link SearcherTaxonomyManager} can successfully refresh and 
serve queries if the
+   * underlying taxonomy index is changed to an older checkpoint. Ideally, 
each checkpoint should be
+   * self-sufficient and should allow serving search queries when {@link
+   * SearcherTaxonomyManager#maybeRefresh()} is called.
+   *
+   * It does not check whether the private taxoArrays were actually 
recreated or no. We are
+   * (correctly) hiding away that complexity away from the user.
+   */
+  private  void testAlwaysRefreshDirectoryTaxonomyReader(
+  Function dtrProducer, Class 
exceptionType)
+  throws IOException {
+final Path taxoPath1 = createTempDir(String.valueOf(Instant.now()));

Review Comment:
   This breaks on windows, becaue ":" is not a valid character for file names.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch deci

2022-04-17 Thread GitBox


uschindler commented on code in PR #762:
URL: https://github.com/apache/lucene/pull/762#discussion_r851798902


##
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestAlwaysRefreshDirectoryTaxonomyReader.java:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy.directory;
+
+import static com.carrotsearch.randomizedtesting.RandomizedTest.sleep;
+import static org.apache.lucene.tests.mockfile.ExtrasFS.isExtra;
+
+import java.io.IOException;
+import java.nio.file.Path;
+import java.time.Instant;
+import java.util.List;
+import java.util.function.Function;
+import org.apache.lucene.facet.FacetTestCase;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.FacetsConfig;
+import org.apache.lucene.facet.taxonomy.FacetLabel;
+import org.apache.lucene.facet.taxonomy.SearcherTaxonomyManager;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.util.IOUtils;
+
+// Nefarious FS will delay/stop deletion of index files and this test 
specifically does that
+@LuceneTestCase.SuppressFileSystems({"WindowsFS", "VirusCheckingFS"})
+public class TestAlwaysRefreshDirectoryTaxonomyReader extends FacetTestCase {
+
+  /**
+   * Tests the behavior of the {@link AlwaysRefreshDirectoryTaxonomyReader} by 
testing if the
+   * associated {@link SearcherTaxonomyManager} can successfully refresh and 
serve queries if the
+   * underlying taxonomy index is changed to an older checkpoint. Ideally, 
each checkpoint should be
+   * self-sufficient and should allow serving search queries when {@link
+   * SearcherTaxonomyManager#maybeRefresh()} is called.
+   *
+   * It does not check whether the private taxoArrays were actually 
recreated or no. We are
+   * (correctly) hiding away that complexity away from the user.
+   */
+  private  void testAlwaysRefreshDirectoryTaxonomyReader(
+  Function dtrProducer, Class 
exceptionType)
+  throws IOException {
+final Path taxoPath1 = createTempDir(String.valueOf(Instant.now()));

Review Comment:
   Why do we need the actual time at all? Every test has its own directory, so 
theres no need to have time in the file name. createTempDir makes sure to have 
a unique file name anyways!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10482) Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide

2022-04-17 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523439#comment-17523439
 ] 

Uwe Schindler commented on LUCENE-10482:


Hey the commit breaks all builds on Windows because the test file name contains 
a ":" (looks like it is coming from the timestamp in filename). Please fix or 
revert!

Why do we need Instant.now() in the filename at all? createTempDir creates a 
unique test-dependend directory anyways. You just need to pass the test's name 
and youre done. Uniqueness is ensured by test framework. The problem with 
Instant.now() is also non-reproducibility!

> Allow users to create their own DirectoryTaxonomyReaders with empty 
> taxoArrays instead of letting the taxoEpoch decide
> --
>
> Key: LUCENE-10482
> URL: https://issues.apache.org/jira/browse/LUCENE-10482
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 9.1
>Reporter: Gautam Worah
>Priority: Minor
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> I was experimenting with the taxonomy index and {{DirectoryTaxonomyReaders}} 
> in my day job where we were trying to replace the index underneath a reader 
> asynchronously and then call the {{doOpenIfChanged}} call on it.
> It turns out that the taxonomy index uses its own index based counter (the 
> {{{}taxonomyIndexEpoch{}}}) to determine if the index was opened in write 
> mode after the last time it was written and if not, it directly tries to 
> reuse the previous {{taxoArrays}} it had created. This logic fails in a 
> scenario where both the old and new index were opened just once but the index 
> itself is completely different in both the cases.
> In such a case, it would be good to give the user the flexibility to inform 
> the DTR to recreate its {{{}taxoArrays{}}}, {{ordinalCache}} and 
> {{{}categoryCache{}}} (not refreshing these arrays causes it to fail in 
> various ways). Luckily, such a constructor already exists! But it is private 
> today! The idea here is to allow subclasses of DTR to use this constructor.
> Curious to see what other folks think about this idea. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #811: Add some basic tasks to help/workflow

2022-04-17 Thread GitBox


dweiss commented on PR #811:
URL: https://github.com/apache/lucene/pull/811#issuecomment-1100945965

   Please, go ahead, @mocobeta. I'm on Easter holidays and with intermittent 
access to the internet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index

2022-04-17 Thread Mayya Sharipova (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523449#comment-17523449
 ] 

Mayya Sharipova commented on LUCENE-10518:
--

[~dnhatn] Thanks for a great investigation! I agree with the idea for 
backporting the two-pase fields processing from Lucene9 to Lucene8, since in 
Lucene8 we already don't allow a field for example to have different types of 
doc values in different documents.

About the second idea to introduce an option to skip checking field-infos 
consistency, I am not sure about this. We've introduced this consistency so 
that we can rely on it for different functionalities (for example sort 
optimization is based on the fact that a field is indexed with doc values and 
points across all documents that have this field; also see [original email| 
https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if we 
break this consistency check, we can't use these functionalities.  

You also suggested a workaround for this issue is to do a force merge on 8.x 
Lucene.  So may be this workaround and a patch for the index #1  would be 
enough?

 

 

> FieldInfos consistency check can refuse to open Lucene 8 index
> --
>
> Key: LUCENE-10518
> URL: https://issues.apache.org/jira/browse/LUCENE-10518
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.10.1
>Reporter: Nhat Nguyen
>Priority: Major
>
> A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can 
> refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if 
> hitting a non-aborting exception (for example [term is too 
> long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944])
>  during processing fields of a document. We don't have this problem in Lucene 
> 9 as we process fields in two phases with the [first 
> phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614]
>  processing only FieldInfos. 
> The issue can be reproduced with this snippet.
> {code:java}
> public void testWriteIndexOn8x() throws Exception {
>   FieldType KeywordField = new FieldType();
>   KeywordField.setTokenized(false);
>   KeywordField.setOmitNorms(true);
>   KeywordField.setIndexOptions(IndexOptions.DOCS);
>   KeywordField.freeze();
>   try (Directory dir = newDirectory()) {
> IndexWriterConfig config = new IndexWriterConfig();
> config.setCommitOnClose(false);
> config.setMergePolicy(NoMergePolicy.INSTANCE);
> try (IndexWriter writer = new IndexWriter(dir, config)) {
>   // first segment
>   writer.addDocument(new Document()); // an empty doc
>   Document d1 = new Document();
>   byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1];
>   Arrays.fill(chars, (byte) 'a');
>   d1.add(new Field("field", new BytesRef(chars), KeywordField));
>   d1.add(new BinaryDocValuesField("field", new BytesRef(chars)));
>   expectThrows(IllegalArgumentException.class, () -> 
> writer.addDocument(d1));
>   writer.flush();
>   // second segment
>   Document d2 = new Document();
>   d2.add(new Field("field", new BytesRef("hello world"), KeywordField));
>   d2.add(new SortedDocValuesField("field", new BytesRef("hello world")));
>   writer.addDocument(d2);
>   writer.flush();
>   writer.commit();
>   // Check for doc values types consistency
>   Map docValuesTypes = new HashMap<>();
>   try(DirectoryReader reader = DirectoryReader.open(dir)){
> for (LeafReaderContext leaf : reader.leaves()) {
>   for (FieldInfo fi : leaf.reader().getFieldInfos()) {
> DocValuesType current = docValuesTypes.putIfAbsent(fi.name, 
> fi.getDocValuesType());
> if (current != null && current != fi.getDocValuesType()) {
>   fail("cannot change DocValues type from " + current + " to " + 
> fi.getDocValuesType() + " for field \"" + fi.name + "\"");
> }
>   }
> }
>   }
> }
>   }
> }
> {code}
> I would like to propose to:
> - Backport the two-phase fields processing from Lucene9 to Lucene8. The patch 
> should be small and contained.
> - Introduce an option in Lucene9 to skip checking field-infos consistency 
> (i.e., behave like Lucene 8 when the option is enabled).
> /cc [~mayya] and [~jpountz]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index

2022-04-17 Thread Mayya Sharipova (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523449#comment-17523449
 ] 

Mayya Sharipova edited comment on LUCENE-10518 at 4/17/22 10:17 PM:


[~dnhatn] Thanks for a great investigation! I agree with the idea for 
backporting the two-pase fields processing from Lucene9 to Lucene8, since in 
Lucene8 we already don't allow a field for example to have different types of 
doc values in different documents.

About the second idea to introduce an option to skip checking field-infos 
consistency, I am not sure about this. We've introduced this consistency so 
that we can rely on it for different functionalities (for example sort 
optimization is based on the fact that a field is indexed with doc values and 
points across all documents that have this field; also see [original 
email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if 
we break this consistency check, we can't use these functionalities. 

You also suggested a workaround for this issue is to do a force merge on 8.x 
Lucene.  So may be this workaround and backporting the two-pase fields 
processing to 8.x  shoud be enough?

 

 


was (Author: mayya):
[~dnhatn] Thanks for a great investigation! I agree with the idea for 
backporting the two-pase fields processing from Lucene9 to Lucene8, since in 
Lucene8 we already don't allow a field for example to have different types of 
doc values in different documents.

About the second idea to introduce an option to skip checking field-infos 
consistency, I am not sure about this. We've introduced this consistency so 
that we can rely on it for different functionalities (for example sort 
optimization is based on the fact that a field is indexed with doc values and 
points across all documents that have this field; also see [original email| 
https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if we 
break this consistency check, we can't use these functionalities.  

You also suggested a workaround for this issue is to do a force merge on 8.x 
Lucene.  So may be this workaround and a patch for the index #1  would be 
enough?

 

 

> FieldInfos consistency check can refuse to open Lucene 8 index
> --
>
> Key: LUCENE-10518
> URL: https://issues.apache.org/jira/browse/LUCENE-10518
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.10.1
>Reporter: Nhat Nguyen
>Priority: Major
>
> A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can 
> refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if 
> hitting a non-aborting exception (for example [term is too 
> long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944])
>  during processing fields of a document. We don't have this problem in Lucene 
> 9 as we process fields in two phases with the [first 
> phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614]
>  processing only FieldInfos. 
> The issue can be reproduced with this snippet.
> {code:java}
> public void testWriteIndexOn8x() throws Exception {
>   FieldType KeywordField = new FieldType();
>   KeywordField.setTokenized(false);
>   KeywordField.setOmitNorms(true);
>   KeywordField.setIndexOptions(IndexOptions.DOCS);
>   KeywordField.freeze();
>   try (Directory dir = newDirectory()) {
> IndexWriterConfig config = new IndexWriterConfig();
> config.setCommitOnClose(false);
> config.setMergePolicy(NoMergePolicy.INSTANCE);
> try (IndexWriter writer = new IndexWriter(dir, config)) {
>   // first segment
>   writer.addDocument(new Document()); // an empty doc
>   Document d1 = new Document();
>   byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1];
>   Arrays.fill(chars, (byte) 'a');
>   d1.add(new Field("field", new BytesRef(chars), KeywordField));
>   d1.add(new BinaryDocValuesField("field", new BytesRef(chars)));
>   expectThrows(IllegalArgumentException.class, () -> 
> writer.addDocument(d1));
>   writer.flush();
>   // second segment
>   Document d2 = new Document();
>   d2.add(new Field("field", new BytesRef("hello world"), KeywordField));
>   d2.add(new SortedDocValuesField("field", new BytesRef("hello world")));
>   writer.addDocument(d2);
>   writer.flush();
>   writer.commit();
>   // Check for doc values types consistency
>   Map docValuesTypes = new HashMap<>();
>   try(DirectoryReader reader = DirectoryReader.open(dir)){
> for (LeafReaderContext leaf 

[jira] [Comment Edited] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index

2022-04-17 Thread Mayya Sharipova (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523449#comment-17523449
 ] 

Mayya Sharipova edited comment on LUCENE-10518 at 4/17/22 10:35 PM:


[~dnhatn] Thanks for a great investigation! I agree with the idea for 
backporting the two-pase fields processing from Lucene9 to Lucene8, since in 
Lucene8 we already don't allow a field for example to have different types of 
doc values in different documents.

About the second idea to introduce an option to skip checking field-infos 
consistency, I am not sure about this. We've introduced this consistency so 
that we can rely on it for different functionalities (for example sort 
optimization is based on the fact that a field is indexed with doc values and 
points across all documents that have this field;  FieldExistsQuery replied on 
it; also  see [original 
email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if 
we break this consistency check, we can't use these functionalities. 

You also suggested a workaround for this issue is to do a force merge on 8.x 
Lucene.  So may be this workaround and backporting the two-pase fields 
processing to 8.x  shoud be enough?

 

 


was (Author: mayya):
[~dnhatn] Thanks for a great investigation! I agree with the idea for 
backporting the two-pase fields processing from Lucene9 to Lucene8, since in 
Lucene8 we already don't allow a field for example to have different types of 
doc values in different documents.

About the second idea to introduce an option to skip checking field-infos 
consistency, I am not sure about this. We've introduced this consistency so 
that we can rely on it for different functionalities (for example sort 
optimization is based on the fact that a field is indexed with doc values and 
points across all documents that have this field; also see [original 
email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if 
we break this consistency check, we can't use these functionalities. 

You also suggested a workaround for this issue is to do a force merge on 8.x 
Lucene.  So may be this workaround and backporting the two-pase fields 
processing to 8.x  shoud be enough?

 

 

> FieldInfos consistency check can refuse to open Lucene 8 index
> --
>
> Key: LUCENE-10518
> URL: https://issues.apache.org/jira/browse/LUCENE-10518
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.10.1
>Reporter: Nhat Nguyen
>Priority: Major
>
> A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can 
> refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if 
> hitting a non-aborting exception (for example [term is too 
> long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944])
>  during processing fields of a document. We don't have this problem in Lucene 
> 9 as we process fields in two phases with the [first 
> phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614]
>  processing only FieldInfos. 
> The issue can be reproduced with this snippet.
> {code:java}
> public void testWriteIndexOn8x() throws Exception {
>   FieldType KeywordField = new FieldType();
>   KeywordField.setTokenized(false);
>   KeywordField.setOmitNorms(true);
>   KeywordField.setIndexOptions(IndexOptions.DOCS);
>   KeywordField.freeze();
>   try (Directory dir = newDirectory()) {
> IndexWriterConfig config = new IndexWriterConfig();
> config.setCommitOnClose(false);
> config.setMergePolicy(NoMergePolicy.INSTANCE);
> try (IndexWriter writer = new IndexWriter(dir, config)) {
>   // first segment
>   writer.addDocument(new Document()); // an empty doc
>   Document d1 = new Document();
>   byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1];
>   Arrays.fill(chars, (byte) 'a');
>   d1.add(new Field("field", new BytesRef(chars), KeywordField));
>   d1.add(new BinaryDocValuesField("field", new BytesRef(chars)));
>   expectThrows(IllegalArgumentException.class, () -> 
> writer.addDocument(d1));
>   writer.flush();
>   // second segment
>   Document d2 = new Document();
>   d2.add(new Field("field", new BytesRef("hello world"), KeywordField));
>   d2.add(new SortedDocValuesField("field", new BytesRef("hello world")));
>   writer.addDocument(d2);
>   writer.flush();
>   writer.commit();
>   // Check for doc values types consistency
>   Map docValuesTypes = new HashMap<>();
>   try(DirectoryReader reader = Director

[jira] [Comment Edited] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index

2022-04-17 Thread Mayya Sharipova (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523449#comment-17523449
 ] 

Mayya Sharipova edited comment on LUCENE-10518 at 4/17/22 10:36 PM:


[~dnhatn] Thanks for a great investigation! I agree with the idea for 
backporting the two-pase fields processing from Lucene9 to Lucene8, since in 
Lucene8 we already don't allow a field for example to have different types of 
doc values in different documents.

About the second idea to introduce an option to skip checking field-infos 
consistency, I am not sure about this. We've introduced this consistency so 
that we can rely on it for different functionalities (for example sort 
optimization is based on the fact that a field is indexed with doc values and 
points across all documents that have this field;  also FieldExistsQuery relies 
on it; also  see [original 
email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. ) So 
if we break this consistency check, we can't use these functionalities. 

You also suggested a workaround for this issue is to do a force merge on 8.x 
Lucene.  So may be this workaround and backporting the two-pase fields 
processing to 8.x  shoud be enough?

 

 


was (Author: mayya):
[~dnhatn] Thanks for a great investigation! I agree with the idea for 
backporting the two-pase fields processing from Lucene9 to Lucene8, since in 
Lucene8 we already don't allow a field for example to have different types of 
doc values in different documents.

About the second idea to introduce an option to skip checking field-infos 
consistency, I am not sure about this. We've introduced this consistency so 
that we can rely on it for different functionalities (for example sort 
optimization is based on the fact that a field is indexed with doc values and 
points across all documents that have this field;  also FieldExistsQuery relies 
on it; also  see [original 
email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if 
we break this consistency check, we can't use these functionalities. 

You also suggested a workaround for this issue is to do a force merge on 8.x 
Lucene.  So may be this workaround and backporting the two-pase fields 
processing to 8.x  shoud be enough?

 

 

> FieldInfos consistency check can refuse to open Lucene 8 index
> --
>
> Key: LUCENE-10518
> URL: https://issues.apache.org/jira/browse/LUCENE-10518
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.10.1
>Reporter: Nhat Nguyen
>Priority: Major
>
> A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can 
> refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if 
> hitting a non-aborting exception (for example [term is too 
> long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944])
>  during processing fields of a document. We don't have this problem in Lucene 
> 9 as we process fields in two phases with the [first 
> phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614]
>  processing only FieldInfos. 
> The issue can be reproduced with this snippet.
> {code:java}
> public void testWriteIndexOn8x() throws Exception {
>   FieldType KeywordField = new FieldType();
>   KeywordField.setTokenized(false);
>   KeywordField.setOmitNorms(true);
>   KeywordField.setIndexOptions(IndexOptions.DOCS);
>   KeywordField.freeze();
>   try (Directory dir = newDirectory()) {
> IndexWriterConfig config = new IndexWriterConfig();
> config.setCommitOnClose(false);
> config.setMergePolicy(NoMergePolicy.INSTANCE);
> try (IndexWriter writer = new IndexWriter(dir, config)) {
>   // first segment
>   writer.addDocument(new Document()); // an empty doc
>   Document d1 = new Document();
>   byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1];
>   Arrays.fill(chars, (byte) 'a');
>   d1.add(new Field("field", new BytesRef(chars), KeywordField));
>   d1.add(new BinaryDocValuesField("field", new BytesRef(chars)));
>   expectThrows(IllegalArgumentException.class, () -> 
> writer.addDocument(d1));
>   writer.flush();
>   // second segment
>   Document d2 = new Document();
>   d2.add(new Field("field", new BytesRef("hello world"), KeywordField));
>   d2.add(new SortedDocValuesField("field", new BytesRef("hello world")));
>   writer.addDocument(d2);
>   writer.flush();
>   writer.commit();
>   // Check for doc values types consistency
>   Map docValuesTypes = new HashMap<>();

[jira] [Comment Edited] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index

2022-04-17 Thread Mayya Sharipova (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523449#comment-17523449
 ] 

Mayya Sharipova edited comment on LUCENE-10518 at 4/17/22 10:36 PM:


[~dnhatn] Thanks for a great investigation! I agree with the idea for 
backporting the two-pase fields processing from Lucene9 to Lucene8, since in 
Lucene8 we already don't allow a field for example to have different types of 
doc values in different documents.

About the second idea to introduce an option to skip checking field-infos 
consistency, I am not sure about this. We've introduced this consistency so 
that we can rely on it for different functionalities (for example sort 
optimization is based on the fact that a field is indexed with doc values and 
points across all documents that have this field;  also FieldExistsQuery relies 
on it; also  see [original 
email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if 
we break this consistency check, we can't use these functionalities. 

You also suggested a workaround for this issue is to do a force merge on 8.x 
Lucene.  So may be this workaround and backporting the two-pase fields 
processing to 8.x  shoud be enough?

 

 


was (Author: mayya):
[~dnhatn] Thanks for a great investigation! I agree with the idea for 
backporting the two-pase fields processing from Lucene9 to Lucene8, since in 
Lucene8 we already don't allow a field for example to have different types of 
doc values in different documents.

About the second idea to introduce an option to skip checking field-infos 
consistency, I am not sure about this. We've introduced this consistency so 
that we can rely on it for different functionalities (for example sort 
optimization is based on the fact that a field is indexed with doc values and 
points across all documents that have this field;  FieldExistsQuery replied on 
it; also  see [original 
email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. So if 
we break this consistency check, we can't use these functionalities. 

You also suggested a workaround for this issue is to do a force merge on 8.x 
Lucene.  So may be this workaround and backporting the two-pase fields 
processing to 8.x  shoud be enough?

 

 

> FieldInfos consistency check can refuse to open Lucene 8 index
> --
>
> Key: LUCENE-10518
> URL: https://issues.apache.org/jira/browse/LUCENE-10518
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.10.1
>Reporter: Nhat Nguyen
>Priority: Major
>
> A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can 
> refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if 
> hitting a non-aborting exception (for example [term is too 
> long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944])
>  during processing fields of a document. We don't have this problem in Lucene 
> 9 as we process fields in two phases with the [first 
> phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614]
>  processing only FieldInfos. 
> The issue can be reproduced with this snippet.
> {code:java}
> public void testWriteIndexOn8x() throws Exception {
>   FieldType KeywordField = new FieldType();
>   KeywordField.setTokenized(false);
>   KeywordField.setOmitNorms(true);
>   KeywordField.setIndexOptions(IndexOptions.DOCS);
>   KeywordField.freeze();
>   try (Directory dir = newDirectory()) {
> IndexWriterConfig config = new IndexWriterConfig();
> config.setCommitOnClose(false);
> config.setMergePolicy(NoMergePolicy.INSTANCE);
> try (IndexWriter writer = new IndexWriter(dir, config)) {
>   // first segment
>   writer.addDocument(new Document()); // an empty doc
>   Document d1 = new Document();
>   byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1];
>   Arrays.fill(chars, (byte) 'a');
>   d1.add(new Field("field", new BytesRef(chars), KeywordField));
>   d1.add(new BinaryDocValuesField("field", new BytesRef(chars)));
>   expectThrows(IllegalArgumentException.class, () -> 
> writer.addDocument(d1));
>   writer.flush();
>   // second segment
>   Document d2 = new Document();
>   d2.add(new Field("field", new BytesRef("hello world"), KeywordField));
>   d2.add(new SortedDocValuesField("field", new BytesRef("hello world")));
>   writer.addDocument(d2);
>   writer.flush();
>   writer.commit();
>   // Check for doc values types consistency
>   Map docValuesTypes = new HashMap<>();
> 

[jira] [Comment Edited] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index

2022-04-17 Thread Mayya Sharipova (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523449#comment-17523449
 ] 

Mayya Sharipova edited comment on LUCENE-10518 at 4/17/22 10:37 PM:


[~dnhatn] Thanks for a great investigation! I agree with the idea for 
backporting the two-pase fields processing from Lucene9 to Lucene8, since in 
Lucene8 we already don't allow a field for example to have different types of 
doc values in different documents.

About the second idea to introduce an option to skip checking field-infos 
consistency, I am not sure about this. We've introduced this consistency so 
that we can rely on it for different functionalities (for example sort 
optimization is based on the fact that a field is indexed with doc values and 
points across all documents that have this field;  also FieldExistsQuery relies 
on it; also  see [original 
email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. ) So 
if we break this consistency check, we can't use these functionalities anymore. 

You also suggested a workaround for this issue is to do a force merge on 8.x 
Lucene.  So may be this workaround and backporting the two-pase fields 
processing to 8.x  shoud be enough?

 

 


was (Author: mayya):
[~dnhatn] Thanks for a great investigation! I agree with the idea for 
backporting the two-pase fields processing from Lucene9 to Lucene8, since in 
Lucene8 we already don't allow a field for example to have different types of 
doc values in different documents.

About the second idea to introduce an option to skip checking field-infos 
consistency, I am not sure about this. We've introduced this consistency so 
that we can rely on it for different functionalities (for example sort 
optimization is based on the fact that a field is indexed with doc values and 
points across all documents that have this field;  also FieldExistsQuery relies 
on it; also  see [original 
email|https://lists.apache.org/thread/2jkdoj237zho8h77l8rgkn7nlklo9pct]. ) So 
if we break this consistency check, we can't use these functionalities. 

You also suggested a workaround for this issue is to do a force merge on 8.x 
Lucene.  So may be this workaround and backporting the two-pase fields 
processing to 8.x  shoud be enough?

 

 

> FieldInfos consistency check can refuse to open Lucene 8 index
> --
>
> Key: LUCENE-10518
> URL: https://issues.apache.org/jira/browse/LUCENE-10518
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.10.1
>Reporter: Nhat Nguyen
>Priority: Major
>
> A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can 
> refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if 
> hitting a non-aborting exception (for example [term is too 
> long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944])
>  during processing fields of a document. We don't have this problem in Lucene 
> 9 as we process fields in two phases with the [first 
> phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614]
>  processing only FieldInfos. 
> The issue can be reproduced with this snippet.
> {code:java}
> public void testWriteIndexOn8x() throws Exception {
>   FieldType KeywordField = new FieldType();
>   KeywordField.setTokenized(false);
>   KeywordField.setOmitNorms(true);
>   KeywordField.setIndexOptions(IndexOptions.DOCS);
>   KeywordField.freeze();
>   try (Directory dir = newDirectory()) {
> IndexWriterConfig config = new IndexWriterConfig();
> config.setCommitOnClose(false);
> config.setMergePolicy(NoMergePolicy.INSTANCE);
> try (IndexWriter writer = new IndexWriter(dir, config)) {
>   // first segment
>   writer.addDocument(new Document()); // an empty doc
>   Document d1 = new Document();
>   byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1];
>   Arrays.fill(chars, (byte) 'a');
>   d1.add(new Field("field", new BytesRef(chars), KeywordField));
>   d1.add(new BinaryDocValuesField("field", new BytesRef(chars)));
>   expectThrows(IllegalArgumentException.class, () -> 
> writer.addDocument(d1));
>   writer.flush();
>   // second segment
>   Document d2 = new Document();
>   d2.add(new Field("field", new BytesRef("hello world"), KeywordField));
>   d2.add(new SortedDocValuesField("field", new BytesRef("hello world")));
>   writer.addDocument(d2);
>   writer.flush();
>   writer.commit();
>   // Check for doc values types consistency
>   Map docValuesTypes = new Has

[jira] [Commented] (LUCENE-10482) Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide

2022-04-17 Thread Gautam Worah (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523464#comment-17523464
 ] 

Gautam Worah commented on LUCENE-10482:
---

Got it. My apologies for introducing this error. I had switched to using 
Instant.now() during testing because when the single test case was failing 
midway it was not able to delete the temporary folder. Subsequent runs of the 
test would fail because a folder with the particular name was already existing. 
Submitting a fix right now

> Allow users to create their own DirectoryTaxonomyReaders with empty 
> taxoArrays instead of letting the taxoEpoch decide
> --
>
> Key: LUCENE-10482
> URL: https://issues.apache.org/jira/browse/LUCENE-10482
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 9.1
>Reporter: Gautam Worah
>Priority: Minor
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> I was experimenting with the taxonomy index and {{DirectoryTaxonomyReaders}} 
> in my day job where we were trying to replace the index underneath a reader 
> asynchronously and then call the {{doOpenIfChanged}} call on it.
> It turns out that the taxonomy index uses its own index based counter (the 
> {{{}taxonomyIndexEpoch{}}}) to determine if the index was opened in write 
> mode after the last time it was written and if not, it directly tries to 
> reuse the previous {{taxoArrays}} it had created. This logic fails in a 
> scenario where both the old and new index were opened just once but the index 
> itself is completely different in both the cases.
> In such a case, it would be good to give the user the flexibility to inform 
> the DTR to recreate its {{{}taxoArrays{}}}, {{ordinalCache}} and 
> {{{}categoryCache{}}} (not refreshing these arrays causes it to fail in 
> various ways). Luckily, such a constructor already exists! But it is private 
> today! The idea here is to allow subclasses of DTR to use this constructor.
> Curious to see what other folks think about this idea. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch d

2022-04-17 Thread GitBox


gautamworah96 commented on code in PR #762:
URL: https://github.com/apache/lucene/pull/762#discussion_r851814501


##
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestAlwaysRefreshDirectoryTaxonomyReader.java:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy.directory;
+
+import static com.carrotsearch.randomizedtesting.RandomizedTest.sleep;
+import static org.apache.lucene.tests.mockfile.ExtrasFS.isExtra;
+
+import java.io.IOException;
+import java.nio.file.Path;
+import java.time.Instant;
+import java.util.List;
+import java.util.function.Function;
+import org.apache.lucene.facet.FacetTestCase;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.FacetsConfig;
+import org.apache.lucene.facet.taxonomy.FacetLabel;
+import org.apache.lucene.facet.taxonomy.SearcherTaxonomyManager;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.util.IOUtils;
+
+// Nefarious FS will delay/stop deletion of index files and this test 
specifically does that
+@LuceneTestCase.SuppressFileSystems({"WindowsFS", "VirusCheckingFS"})
+public class TestAlwaysRefreshDirectoryTaxonomyReader extends FacetTestCase {
+
+  /**
+   * Tests the behavior of the {@link AlwaysRefreshDirectoryTaxonomyReader} by 
testing if the
+   * associated {@link SearcherTaxonomyManager} can successfully refresh and 
serve queries if the
+   * underlying taxonomy index is changed to an older checkpoint. Ideally, 
each checkpoint should be
+   * self-sufficient and should allow serving search queries when {@link
+   * SearcherTaxonomyManager#maybeRefresh()} is called.
+   *
+   * It does not check whether the private taxoArrays were actually 
recreated or no. We are
+   * (correctly) hiding away that complexity away from the user.
+   */
+  private  void testAlwaysRefreshDirectoryTaxonomyReader(
+  Function dtrProducer, Class 
exceptionType)
+  throws IOException {
+final Path taxoPath1 = createTempDir(String.valueOf(Instant.now()));

Review Comment:
   Apologies for this. I've replied on the JIRA thread as well. I had switched 
to using Instant.now() during testing because when the single test case was 
failing midway it was not able to delete the temporary folder. Subsequent runs 
of the test would fail because a folder with the particular name was already 
existing. Non reproducibility is also an issue +1.
   Submitting a fix right now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 opened a new pull request, #814: LUCENE-10482 Bug Fix: Don't use Instant.now() as prefix for the temp dir name

2022-04-17 Thread GitBox


gautamworah96 opened a new pull request, #814:
URL: https://github.com/apache/lucene/pull/814

   Tests in Windows are failing because of an invalid `:` character in the 
directory name. 
   
   Full stack trace and the comment which detected the issue: 
https://github.com/apache/lucene/pull/762#issuecomment-1100941707
   Previous PR: https://github.com/apache/lucene/pull/762
   
   # Solution
   
   Rely on the default implementation of `createTempDir`
   
   # Tests
   
   Multiple runs of the `testAlwaysRefreshDirectoryTaxonomyReader` test through 
the IDE on a Mac.
   
   ---
   
   I'll also submit a fix for the backported commit once this one gets merged.
   Apologies again.
   
   cc: @uschindler @mikemccand 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch d

2022-04-17 Thread GitBox


gautamworah96 commented on code in PR #762:
URL: https://github.com/apache/lucene/pull/762#discussion_r851816491


##
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestAlwaysRefreshDirectoryTaxonomyReader.java:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy.directory;
+
+import static com.carrotsearch.randomizedtesting.RandomizedTest.sleep;
+import static org.apache.lucene.tests.mockfile.ExtrasFS.isExtra;
+
+import java.io.IOException;
+import java.nio.file.Path;
+import java.time.Instant;
+import java.util.List;
+import java.util.function.Function;
+import org.apache.lucene.facet.FacetTestCase;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.FacetsConfig;
+import org.apache.lucene.facet.taxonomy.FacetLabel;
+import org.apache.lucene.facet.taxonomy.SearcherTaxonomyManager;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.util.IOUtils;
+
+// Nefarious FS will delay/stop deletion of index files and this test 
specifically does that
+@LuceneTestCase.SuppressFileSystems({"WindowsFS", "VirusCheckingFS"})
+public class TestAlwaysRefreshDirectoryTaxonomyReader extends FacetTestCase {
+
+  /**
+   * Tests the behavior of the {@link AlwaysRefreshDirectoryTaxonomyReader} by 
testing if the
+   * associated {@link SearcherTaxonomyManager} can successfully refresh and 
serve queries if the
+   * underlying taxonomy index is changed to an older checkpoint. Ideally, 
each checkpoint should be
+   * self-sufficient and should allow serving search queries when {@link
+   * SearcherTaxonomyManager#maybeRefresh()} is called.
+   *
+   * It does not check whether the private taxoArrays were actually 
recreated or no. We are
+   * (correctly) hiding away that complexity away from the user.
+   */
+  private  void testAlwaysRefreshDirectoryTaxonomyReader(
+  Function dtrProducer, Class 
exceptionType)
+  throws IOException {
+final Path taxoPath1 = createTempDir(String.valueOf(Instant.now()));

Review Comment:
   Created https://github.com/apache/lucene/pull/814



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand merged pull request #814: LUCENE-10482 Bug Fix: Don't use Instant.now() as prefix for the temp dir name

2022-04-17 Thread GitBox


mikemccand merged PR #814:
URL: https://github.com/apache/lucene/pull/814


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10482) Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide

2022-04-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523472#comment-17523472
 ] 

ASF subversion and git services commented on LUCENE-10482:
--

Commit d322be52f2407419c9fff4afee84ce4c87fe018d in lucene's branch 
refs/heads/main from Gautam Worah
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d322be52f24 ]

LUCENE-10482 Bug Fix: Don't use Instant.now() as prefix for the temp dir name 
(#814)

* Don't use Instant.now() as prefix for the temp dir name

* spotless

> Allow users to create their own DirectoryTaxonomyReaders with empty 
> taxoArrays instead of letting the taxoEpoch decide
> --
>
> Key: LUCENE-10482
> URL: https://issues.apache.org/jira/browse/LUCENE-10482
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 9.1
>Reporter: Gautam Worah
>Priority: Minor
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> I was experimenting with the taxonomy index and {{DirectoryTaxonomyReaders}} 
> in my day job where we were trying to replace the index underneath a reader 
> asynchronously and then call the {{doOpenIfChanged}} call on it.
> It turns out that the taxonomy index uses its own index based counter (the 
> {{{}taxonomyIndexEpoch{}}}) to determine if the index was opened in write 
> mode after the last time it was written and if not, it directly tries to 
> reuse the previous {{taxoArrays}} it had created. This logic fails in a 
> scenario where both the old and new index were opened just once but the index 
> itself is completely different in both the cases.
> In such a case, it would be good to give the user the flexibility to inform 
> the DTR to recreate its {{{}taxoArrays{}}}, {{ordinalCache}} and 
> {{{}categoryCache{}}} (not refreshing these arrays causes it to fail in 
> various ways). Luckily, such a constructor already exists! But it is private 
> today! The idea here is to allow subclasses of DTR to use this constructor.
> Curious to see what other folks think about this idea. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #811: Add some basic tasks to help/workflow

2022-04-17 Thread GitBox


mocobeta commented on PR #811:
URL: https://github.com/apache/lucene/pull/811#issuecomment-1101035368

   Thanks and apologies for interrupting your holidays, I'll merge this soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta merged pull request #811: Add some basic tasks to help/workflow

2022-04-17 Thread GitBox


mocobeta merged PR #811:
URL: https://github.com/apache/lucene/pull/811


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #805: LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori

2022-04-17 Thread GitBox


mocobeta commented on code in PR #805:
URL: https://github.com/apache/lucene/pull/805#discussion_r851846785


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/ViterbiNBest.java:
##
@@ -0,0 +1,1253 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.EnumMap;
+import java.util.HashMap;
+import java.util.List;
+import org.apache.lucene.analysis.ja.dict.CharacterDefinition;
+import org.apache.lucene.analysis.ja.dict.JaMorphData;
+import org.apache.lucene.analysis.ja.dict.TokenInfoDictionary;
+import org.apache.lucene.analysis.ja.dict.UnknownDictionary;
+import org.apache.lucene.analysis.ja.dict.UserDictionary;
+import org.apache.lucene.analysis.morph.ConnectionCosts;
+import org.apache.lucene.analysis.morph.Dictionary;
+import org.apache.lucene.analysis.morph.GraphvizFormatter;
+import org.apache.lucene.analysis.morph.TokenInfoFST;
+import org.apache.lucene.analysis.morph.TokenType;
+import org.apache.lucene.analysis.morph.Viterbi;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.fst.FST;
+
+/**
+ * {@link org.apache.lucene.analysis.morph.Viterbi} subclass for Japanese 
morphological analysis.
+ * This also performs n-best path calculation
+ */
+final class ViterbiNBest
+extends org.apache.lucene.analysis.morph.Viterbi {
+
+  private final EnumMap> 
dictionaryMap =
+  new EnumMap<>(TokenType.class);
+
+  private final UnknownDictionary unkDictionary;
+  private final CharacterDefinition characterDefinition;
+  private final UserDictionary userDictionary;
+
+  private final boolean discardPunctuation;
+  private final boolean searchMode;
+  private final boolean extendedMode;
+  private final boolean outputCompounds;
+
+  // Allowable cost difference for N-best output:
+  private int nBestCost = 0;
+
+  Lattice lattice = null;
+
+  private GraphvizFormatter dotOut;
+
+  ViterbiNBest(
+  TokenInfoFST fst,
+  FST.BytesReader fstReader,
+  TokenInfoDictionary dictionary,
+  TokenInfoFST userFST,
+  FST.BytesReader userFSTReader,
+  UserDictionary userDictionary,
+  ConnectionCosts costs,
+  Class positionImpl,
+  UnknownDictionary unkDictionary,
+  CharacterDefinition characterDefinition,
+  boolean discardPunctuation,
+  boolean searchMode,
+  boolean extendedMode,
+  boolean outputCompounds) {
+super(fst, fstReader, dictionary, userFST, userFSTReader, userDictionary, 
costs, positionImpl);
+this.unkDictionary = unkDictionary;
+this.characterDefinition = characterDefinition;
+this.userDictionary = userDictionary;
+this.discardPunctuation = discardPunctuation;
+this.searchMode = searchMode;
+this.extendedMode = extendedMode;
+this.outputCompounds = outputCompounds;
+dictionaryMap.put(TokenType.KNOWN, dictionary);
+dictionaryMap.put(TokenType.UNKNOWN, unkDictionary);
+dictionaryMap.put(TokenType.USER, userDictionary);
+  }
+
+  @Override
+  protected boolean shouldSkipProcessUnknownWord(int unknownWordEndIndex, 
Position posData) {
+return !searchMode && 
super.shouldSkipProcessUnknownWord(unknownWordEndIndex, posData);
+  }
+
+  private static final int SEARCH_MODE_KANJI_LENGTH = 2;
+  private static final int SEARCH_MODE_OTHER_LENGTH = 7; // Must be >= 
SEARCH_MODE_KANJI_LENGTH
+  private static final int SEARCH_MODE_KANJI_PENALTY = 3000;
+  private static final int SEARCH_MODE_OTHER_PENALTY = 1700;
+
+  @Override
+  protected int computePenalty(int pos, int length) throws IOException {
+if (length > SEARCH_MODE_KANJI_LENGTH) {
+  boolean allKanji = true;
+  // check if node consists of only kanji
+  final int endPos = pos + length;
+  for (int pos2 = pos; pos2 < endPos; pos2++) {
+if (!characterDefinition.isKanji((char) buffer.get(pos2))) {
+  allKanji = false;
+  break;
+}
+  }
+  if (allKanji) { // Process only Kanji keywords
+return (length - SEARCH_MODE_KANJI_LENGTH) * SEARCH_MODE_KANJI_PENALTY;
+  } else if (length > SEARCH_MODE_OTHER_LENGTH)

[GitHub] [lucene] gautamworah96 opened a new pull request, #815: Backport LUCENE-10482 Bug Fix: Don't use Instant.now() as prefix for the temp dir name

2022-04-17 Thread GitBox


gautamworah96 opened a new pull request, #815:
URL: https://github.com/apache/lucene/pull/815

   Backport of this PR: https://github.com/apache/lucene/pull/814
   
   Tests in Windows are failing because of an invalid : character in the 
directory name.
   
   Full stack trace and the comment which detected the issue: 
https://github.com/apache/lucene/pull/762#issuecomment-1100941707
   Previous PR which introduced the bug in `main`: 
https://github.com/apache/lucene/pull/762
   Sorry for all the trouble this caused!
   
   ### Test
   
   Ran `testAlwaysRefreshDirectoryTaxonomyReader` multiple times through the 
IDE on a Mac
   
   cc @uschindler @mikemccand 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #815: Backport LUCENE-10482 Bug Fix: Don't use Instant.now() as prefix for the temp dir name

2022-04-17 Thread GitBox


uschindler commented on code in PR #815:
URL: https://github.com/apache/lucene/pull/815#discussion_r851923047


##
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestAlwaysRefreshDirectoryTaxonomyReader.java:
##
@@ -53,15 +52,15 @@ public class TestAlwaysRefreshDirectoryTaxonomyReader 
extends FacetTestCase {
   private  void testAlwaysRefreshDirectoryTaxonomyReader(
   Function dtrProducer, Class 
exceptionType)
   throws IOException {
-final Path taxoPath1 = createTempDir(String.valueOf(Instant.now()));
+final Path taxoPath1 = createTempDir();

Review Comment:
   You could still give some static string as prefix, but as this class only 
has one test, we don't need it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org