[GitHub] [lucene] zacharymorn commented on a change in pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index check across segments

2021-08-29 Thread GitBox


zacharymorn commented on a change in pull request #128:
URL: https://github.com/apache/lucene/pull/128#discussion_r697971217



##
File path: lucene/test-framework/src/java/org/apache/lucene/util/TestUtil.java
##
@@ -321,6 +326,11 @@ public static void syncConcurrentMerges(MergeScheduler ms) 
{
   checker.setDoSlowChecks(doSlowChecks);
   checker.setFailFast(failFast);
   checker.setInfoStream(new PrintStream(output, false, IOUtils.UTF_8), 
false);
+  if (concurrent) {
+checker.setThreadCount(RandomizedTest.randomIntBetween(1, 5));
+  } else {
+checker.setThreadCount(0);

Review comment:
   Ok after some more thoughts I do think that makes sense, particularly 
around the idea of this (`main` vs. `executor`) is really an implementation 
detail that users shouldn't need to worry about, and so `-threadCount 1` to 
represent single-threaded execution would be the most intuitive approach here. 
I've pushed a new commit to update this accordingly.
   
   Also, in the latest commit I've made it the default that when users don't 
specify `-threadCount` value via command line, concurrent index checking will 
be used on machines with more than 1 core, but the number of threads will be 
capped at 4 via `Math.min(Runtime.getRuntime().availableProcessors(), 4)`.  I 
think this default behavior was discussed in multiple places in this PR, but 
would like to double confirm that this is the preferred default setting we 
would like to have (versus sequential index checking)?

##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -450,6 +480,14 @@ public void setChecksumsOnly(boolean v) {
 
   private boolean checksumsOnly;
 
+  /** Set threadCount used for parallelizing index integrity checking. */
+  public void setThreadCount(int tc) {
+threadCount = tc;

Review comment:
   Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wuda0112 commented on pull request #224: LUCENE-10035: Simple text codec add multi level skip list data

2021-08-29 Thread GitBox


wuda0112 commented on pull request #224:
URL: https://github.com/apache/lucene/pull/224#issuecomment-907748573


   > So awesome that `SimpleText` is finally getting a skipping implementation! 
Thank you @wuda0112!
   > 
   > Have you tried stressing out the new code by running all Lucene unit-tests 
with `SimpleText`? Something like `./gradlew -Dtests.codec=SimpleText`.
   
   @mikemccand  Thanks for your reply! I have execute `./gradlew check` 
`./gradlew test` `./gradlew -p lucene/core/ test` `./gradlew -p lucene/codecs/ 
test`, and all passed!
   But when i execute `./gradlew test -Dtests.codec=SimpleText`, there has one 
FAILED, the exception is:
   ```
   org.apache.lucene.demo.TestDemo > testKnnVectorSearch FAILED
   java.lang.UnsupportedOperationException
   at 
__randomizedtesting.SeedInfo.seed([2DAF064A760C3554:7F0D5D18D85E7028]:0)
   at 
org.apache.lucene.codecs.simpletext.SimpleTextKnnVectorsReader.search(SimpleTextKnnVectorsReader.java:143)
   ```
   when i look into SimpleTextKnnVectorsReader#search, it `throw new 
UnsupportedOperationException`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10068) Switch to a "double barrel" HPPC cache for the taxonomy LRU cache

2021-08-29 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406403#comment-17406403
 ] 

Michael McCandless commented on LUCENE-10068:
-

Yeah that's a good point [~rcmuir] – maybe we should just remove this cache!  
Especially with the upcoming {{getBulkPath}} it's likely this cache is not 
helpful anymore.  So +1 to first test benchmarks with and without the cache and 
then maybe nuke it entirely.

> Switch to a "double barrel" HPPC cache for the taxonomy LRU cache
> -
>
> Key: LUCENE-10068
> URL: https://issues.apache.org/jira/browse/LUCENE-10068
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.8.1
>Reporter: Gautam Worah
>Priority: Minor
>
> While working on an unrelated getBulkPath API 
> [PR|https://github.com/apache/lucene/pull/179], [~mikemccand] and I came 
> across a nice optimization that could be made to the taxonomy cache.
> The taxonomy cache today caches frequently used ordinals and their 
> corresponding FacetLabels. It uses the existing LRUHashMap (backed by a 
> LinkedList) class for its implementation.
> This implementation performs sub optimally when it has a large number of 
> threads accessing it, and consumes a large amount of RAM.
> [~mikemccand] suggested the idea of a two array backed HPPC int->FacetLabel 
> cache. The basic idea behind the cache being:
>  # We use two hashmaps primary and secondary.
>  # In case of a cache miss in the primary and a cache hit in the secondary, 
> we add the key to the primary map as well.
>  # In case of a cache miss in both the maps, we add it to the primary map.
>  # When we reach (make this check each time we insert?) a large number of 
> elements in say the primary cache, (say larger than the existing 
> {color:#871094}DEFAULT_CACHE_VALUE{color}=4000), we dump the secondary map 
> and copy all the values of the primary map into it.
> The idea was originally explained in 
> [this|https://github.com/apache/lucene/pull/179#discussion_r692907559] 
> comment.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #224: LUCENE-10035: Simple text codec add multi level skip list data

2021-08-29 Thread GitBox


mikemccand commented on pull request #224:
URL: https://github.com/apache/lucene/pull/224#issuecomment-907786967


   > > So awesome that `SimpleText` is finally getting a skipping 
implementation! Thank you @wuda0112!
   > > Have you tried stressing out the new code by running all Lucene 
unit-tests with `SimpleText`? Something like `./gradlew 
-Dtests.codec=SimpleText`.
   > 
   > @mikemccand Thanks for your reply! I have execute `./gradlew check` 
`./gradlew test` `./gradlew -p lucene/core/ test` `./gradlew -p lucene/codecs/ 
test`, and all passed!
   > But when i execute `./gradlew test -Dtests.codec=SimpleText`, there has 
one FAILED, the exception is:
   > 
   > ```
   > org.apache.lucene.demo.TestDemo > testKnnVectorSearch FAILED
   > java.lang.UnsupportedOperationException
   > at 
__randomizedtesting.SeedInfo.seed([2DAF064A760C3554:7F0D5D18D85E7028]:0)
   > at 
org.apache.lucene.codecs.simpletext.SimpleTextKnnVectorsReader.search(SimpleTextKnnVectorsReader.java:143)
   > ```
   > 
   > when i look into SimpleTextKnnVectorsReader#search, it `throw new 
UnsupportedOperationException`
   
   Ahh it looks like `SimpleText` is missing its KNN implementation but looks 
like it's coming soon!  https://issues.apache.org/jira/browse/LUCENE-10063.  In 
the meantime that failing test should add an `assume` that the current codec is 
not `SimpleText`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a change in pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index check across segments

2021-08-29 Thread GitBox


mikemccand commented on a change in pull request #128:
URL: https://github.com/apache/lucene/pull/128#discussion_r698012354



##
File path: lucene/test-framework/src/java/org/apache/lucene/util/TestUtil.java
##
@@ -321,6 +326,11 @@ public static void syncConcurrentMerges(MergeScheduler ms) 
{
   checker.setDoSlowChecks(doSlowChecks);
   checker.setFailFast(failFast);
   checker.setInfoStream(new PrintStream(output, false, IOUtils.UTF_8), 
false);
+  if (concurrent) {
+checker.setThreadCount(RandomizedTest.randomIntBetween(1, 5));
+  } else {
+checker.setThreadCount(0);

Review comment:
   Great thanks @zacharymorn  -- that sounds like a good approach.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wuda0112 commented on pull request #224: LUCENE-10035: Simple text codec add multi level skip list data

2021-08-29 Thread GitBox


wuda0112 commented on pull request #224:
URL: https://github.com/apache/lucene/pull/224#issuecomment-907810690


   > > > So awesome that `SimpleText` is finally getting a skipping 
implementation! Thank you @wuda0112!
   > > > Have you tried stressing out the new code by running all Lucene 
unit-tests with `SimpleText`? Something like `./gradlew 
-Dtests.codec=SimpleText`.
   > > 
   > > 
   > > @mikemccand Thanks for your reply! I have execute `./gradlew check` 
`./gradlew test` `./gradlew -p lucene/core/ test` `./gradlew -p lucene/codecs/ 
test`, and all passed!
   > > But when i execute `./gradlew test -Dtests.codec=SimpleText`, there has 
one FAILED, the exception is:
   > > ```
   > > org.apache.lucene.demo.TestDemo > testKnnVectorSearch FAILED
   > > java.lang.UnsupportedOperationException
   > > at 
__randomizedtesting.SeedInfo.seed([2DAF064A760C3554:7F0D5D18D85E7028]:0)
   > > at 
org.apache.lucene.codecs.simpletext.SimpleTextKnnVectorsReader.search(SimpleTextKnnVectorsReader.java:143)
   > > ```
   > > 
   > > 
   > > 
   > >   
   > > 
   > > 
   > >   
   > > 
   > > 
   > > 
   > >   
   > > when i look into SimpleTextKnnVectorsReader#search, it `throw new 
UnsupportedOperationException`
   > 
   > Ahh it looks like `SimpleText` is missing its KNN implementation but looks 
like it's coming soon! https://issues.apache.org/jira/browse/LUCENE-10063. In 
the meantime that failing test should add an `assume` that the current codec is 
not `SimpleText`.
   
   Ok, got it, thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #271: LUCENE-9969:TaxoArrays, a member variable of the DirectoryTaxonomyReader class, i…

2021-08-29 Thread GitBox


rmuir commented on pull request #271:
URL: https://github.com/apache/lucene/pull/271#issuecomment-907829353


   could we look at storing this stuff as docvalues instead of as payloads that 
are read into heap memory? It is just integers or list of integers right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-29 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698055359



##
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java
##
@@ -89,6 +89,28 @@ private void createNewTaxonomyIndex(String dirName) throws 
IOException {
 dir.close();
   }
 
+  // Opens up a pre-existing index and tries to run getBulkPath on it
+  public void testGetBulkPathOnOlderCodec() throws Exception {
+Path indexDir = createTempDir(oldTaxonomyIndexName);
+TestUtil.unzip(getDataInputStream(oldTaxonomyIndexName + ".zip"), 
indexDir);
+Directory dir = newFSDirectory(indexDir);
+
+DirectoryTaxonomyWriter writer = new DirectoryTaxonomyWriter(dir);
+DirectoryTaxonomyReader reader = new DirectoryTaxonomyReader(writer);
+
+FacetLabel[] facetLabels = {
+  new FacetLabel("a"), new FacetLabel("b"),
+};
+int[] ordinals =
+new int[] {reader.getOrdinal(facetLabels[0]), 
reader.getOrdinal(facetLabels[1])};
+// The zip file already contains a category "a" and a category "b" stored 
with the older

Review comment:
   We have already added this condition in the `createNewTaxonomyIndex()` 
test (i.e the back compat test we had added for the BDV switch). See 
[here](https://github.com/apache/lucene/blob/56eb76dbaf7b2e9a4fbf7c1a25af2ba81ab1bc92/lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java#L74)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-29 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698056689



##
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyCombined.java
##
@@ -366,9 +366,9 @@ public void testReaderBasic() throws Exception {
   }
 }
 //  (also test invalid ordinals:)
-assertNull(tr.getPath(-1));
-assertNull(tr.getPath(tr.getSize()));
-assertNull(tr.getPath(TaxonomyReader.INVALID_ORDINAL));
+expectThrows(IllegalArgumentException.class, () -> tr.getPath(-1));

Review comment:
   Sure. Added it




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10077) Closing the DirTaxonomyReader while another thread access the cache can throw NPE

2021-08-29 Thread Gautam Worah (Jira)
Gautam Worah created LUCENE-10077:
-

 Summary: Closing the DirTaxonomyReader while another thread access 
the cache can throw NPE
 Key: LUCENE-10077
 URL: https://issues.apache.org/jira/browse/LUCENE-10077
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Affects Versions: main (9.0)
Reporter: Gautam Worah


When we close a {{DirectoryTaxonomyReader}} in {{doClose}}, we set the 
{{categoryCache}} to null. But if a thread is next after this {{doClose}} call, 
it will still try to acquire a lock and {{synchronize}} on it. This will result 
in an NPE.

This works well today, because we operate on the assumption that the user will 
always call {{doClose}} after all threads have completed. 
 One suggestion by [~mikemccand] in this 
[PR|https://github.com/apache/lucene/pull/179#discussion_r697880516] was to 
make categoryCache final and throw an AlreadyClosedException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-29 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698061294



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel[] getPathFromCache(int... ordinals) {
+FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+// TODO LUCENE-10068: can we use an int-based hash impl, such as 
IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  for (int i = 0; i < ordinals.length; i++) {
+facetLabels[i] = categoryCache.get(ordinals[i]);
+  }
+}
+return facetLabels;
+  }
+
+  /**
+   * Checks if the ordinals in the array are >=0 and < {@code
+   * DirectoryTaxonomyReader#indexReader.maxDoc()}
+   *
+   * @param ordinals Integer array of ordinals
+   * @throws IllegalArgumentException Throw an IllegalArgumentException if one 
of the ordinals is
+   * out of bounds
+   */
+  private void checkOrdinalBounds(int... ordinals) throws 
IllegalArgumentException {
+for (int ordinal : ordinals) {
+  if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+throw new IllegalArgumentException(
+"ordinal "
++ ordinal
++ " is out of the range of the indexReader "
++ indexReader.toString()
++ ". The maximum possible ordinal number is "
++ (indexReader.maxDoc() - 1));
+  }
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   * index
+   */
+  @Override
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ensureOpen();
+checkOrdinalBounds(ordinals);
+
+int ordinalsLength = ordinals.length;
+FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+// remember the original positions of ordinals before they are sorted
+int[] originalPosition = new int[ordinalsLength];
+Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+getPathFromCache(ordinals);

Review comment:
   Ahh. This can indeed happen. Opened 
https://issues.apache.org/jira/browse/LUCENE-10077




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-29 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698080163



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel[] getPathFromCache(int... ordinals) {
+FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+// TODO LUCENE-10068: can we use an int-based hash impl, such as 
IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  for (int i = 0; i < ordinals.length; i++) {
+facetLabels[i] = categoryCache.get(ordinals[i]);
+  }
+}
+return facetLabels;
+  }
+
+  /**
+   * Checks if the ordinals in the array are >=0 and < {@code
+   * DirectoryTaxonomyReader#indexReader.maxDoc()}
+   *
+   * @param ordinals Integer array of ordinals
+   * @throws IllegalArgumentException Throw an IllegalArgumentException if one 
of the ordinals is
+   * out of bounds
+   */
+  private void checkOrdinalBounds(int... ordinals) throws 
IllegalArgumentException {
+for (int ordinal : ordinals) {
+  if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+throw new IllegalArgumentException(
+"ordinal "
++ ordinal
++ " is out of the range of the indexReader "
++ indexReader.toString()
++ ". The maximum possible ordinal number is "
++ (indexReader.maxDoc() - 1));
+  }
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   * index
+   */
+  @Override
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ensureOpen();
+checkOrdinalBounds(ordinals);
+
+int ordinalsLength = ordinals.length;
+FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+// remember the original positions of ordinals before they are sorted
+int[] originalPosition = new int[ordinalsLength];
+Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+getPathFromCache(ordinals);
+
+/* parallel sort the ordinals and originalPosition array based on the 
values in the ordinals array */
+new InPlaceMergeSorter() {
+  @Override
+  protected void swap(int i, int j) {
+int x = ordinals[i];
+ordinals[i] = ordinals[j];
+ordinals[j] = x;
+
+x = originalPosition[i];
+originalPosition[i] = originalPosition[j];
+originalPosition[j] = x;
+  }
+
+  @Override
+  public int compare(int i, int j) {
+return Integer.compare(ordinals[i], ordinals[j]);
+  }
+}.sort(0, ordinalsLength);
+
+int readerIndex;
+int leafReaderMaxDoc = 0;
+int leafReaderDocBase = 0;
+LeafReader leafReader;
+LeafReaderContext leafReaderContext;
+BinaryDocValues values = null;
+List uncachedOrdinalPositions = new ArrayList<>();
+
+for (int i = 0; i < ordinalsLength; i++) {
+  if (bulkPath[originalPosition[i]] == null) {
+/*
+If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find 
the next leaf that contains our ordinal.
+Remember: ordinals[i] operates in the global ordinal space and hence 
we add leafReaderDocBase to the leafReaderMaxDoc
+(which is the maxDoc of the specific leaf)
+ */
+if (values == null || ordinals[i] >= leafReaderDocBase + 
leafReaderMaxDoc) {
+
+  readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+  leafReaderContext = indexReader.leaves().get(readerIndex);
+  leafReader = leafReaderContext.reader();
+  leafReaderMaxDoc = leafReader.maxDoc();
+  leafReaderDocBase = leafReaderContext.docBase;
+  values = leafReader.getBinaryDocValues(Consts.FULL);
+
+  /*
+  If the index is constructed with the older StoredFields it will not 
have any BinaryDocValues field and will return null
+   */
+  if (values == null) {

Review comment:
   Sure. I'll create it after this PR is merged. Otherwise, it just feels 
weird to create an issue for a fix in a piece of code that has not been merged. 




-- 
This is an automated message from the A

[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-29 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698080163



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel[] getPathFromCache(int... ordinals) {
+FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+// TODO LUCENE-10068: can we use an int-based hash impl, such as 
IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  for (int i = 0; i < ordinals.length; i++) {
+facetLabels[i] = categoryCache.get(ordinals[i]);
+  }
+}
+return facetLabels;
+  }
+
+  /**
+   * Checks if the ordinals in the array are >=0 and < {@code
+   * DirectoryTaxonomyReader#indexReader.maxDoc()}
+   *
+   * @param ordinals Integer array of ordinals
+   * @throws IllegalArgumentException Throw an IllegalArgumentException if one 
of the ordinals is
+   * out of bounds
+   */
+  private void checkOrdinalBounds(int... ordinals) throws 
IllegalArgumentException {
+for (int ordinal : ordinals) {
+  if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+throw new IllegalArgumentException(
+"ordinal "
++ ordinal
++ " is out of the range of the indexReader "
++ indexReader.toString()
++ ". The maximum possible ordinal number is "
++ (indexReader.maxDoc() - 1));
+  }
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   * index
+   */
+  @Override
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ensureOpen();
+checkOrdinalBounds(ordinals);
+
+int ordinalsLength = ordinals.length;
+FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+// remember the original positions of ordinals before they are sorted
+int[] originalPosition = new int[ordinalsLength];
+Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+getPathFromCache(ordinals);
+
+/* parallel sort the ordinals and originalPosition array based on the 
values in the ordinals array */
+new InPlaceMergeSorter() {
+  @Override
+  protected void swap(int i, int j) {
+int x = ordinals[i];
+ordinals[i] = ordinals[j];
+ordinals[j] = x;
+
+x = originalPosition[i];
+originalPosition[i] = originalPosition[j];
+originalPosition[j] = x;
+  }
+
+  @Override
+  public int compare(int i, int j) {
+return Integer.compare(ordinals[i], ordinals[j]);
+  }
+}.sort(0, ordinalsLength);
+
+int readerIndex;
+int leafReaderMaxDoc = 0;
+int leafReaderDocBase = 0;
+LeafReader leafReader;
+LeafReaderContext leafReaderContext;
+BinaryDocValues values = null;
+List uncachedOrdinalPositions = new ArrayList<>();
+
+for (int i = 0; i < ordinalsLength; i++) {
+  if (bulkPath[originalPosition[i]] == null) {
+/*
+If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find 
the next leaf that contains our ordinal.
+Remember: ordinals[i] operates in the global ordinal space and hence 
we add leafReaderDocBase to the leafReaderMaxDoc
+(which is the maxDoc of the specific leaf)
+ */
+if (values == null || ordinals[i] >= leafReaderDocBase + 
leafReaderMaxDoc) {
+
+  readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+  leafReaderContext = indexReader.leaves().get(readerIndex);
+  leafReader = leafReaderContext.reader();
+  leafReaderMaxDoc = leafReader.maxDoc();
+  leafReaderDocBase = leafReaderContext.docBase;
+  values = leafReader.getBinaryDocValues(Consts.FULL);
+
+  /*
+  If the index is constructed with the older StoredFields it will not 
have any BinaryDocValues field and will return null
+   */
+  if (values == null) {

Review comment:
   Sure. I'll create it after this PR is merged. Otherwise, it just feels 
weird to create an issue for a fix in a piece of code that has not been merged. 
   
   The fix should be trivial actually. We c

[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-29 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698080163



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel[] getPathFromCache(int... ordinals) {
+FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+// TODO LUCENE-10068: can we use an int-based hash impl, such as 
IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  for (int i = 0; i < ordinals.length; i++) {
+facetLabels[i] = categoryCache.get(ordinals[i]);
+  }
+}
+return facetLabels;
+  }
+
+  /**
+   * Checks if the ordinals in the array are >=0 and < {@code
+   * DirectoryTaxonomyReader#indexReader.maxDoc()}
+   *
+   * @param ordinals Integer array of ordinals
+   * @throws IllegalArgumentException Throw an IllegalArgumentException if one 
of the ordinals is
+   * out of bounds
+   */
+  private void checkOrdinalBounds(int... ordinals) throws 
IllegalArgumentException {
+for (int ordinal : ordinals) {
+  if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+throw new IllegalArgumentException(
+"ordinal "
++ ordinal
++ " is out of the range of the indexReader "
++ indexReader.toString()
++ ". The maximum possible ordinal number is "
++ (indexReader.maxDoc() - 1));
+  }
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   * index
+   */
+  @Override
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ensureOpen();
+checkOrdinalBounds(ordinals);
+
+int ordinalsLength = ordinals.length;
+FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+// remember the original positions of ordinals before they are sorted
+int[] originalPosition = new int[ordinalsLength];
+Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+getPathFromCache(ordinals);
+
+/* parallel sort the ordinals and originalPosition array based on the 
values in the ordinals array */
+new InPlaceMergeSorter() {
+  @Override
+  protected void swap(int i, int j) {
+int x = ordinals[i];
+ordinals[i] = ordinals[j];
+ordinals[j] = x;
+
+x = originalPosition[i];
+originalPosition[i] = originalPosition[j];
+originalPosition[j] = x;
+  }
+
+  @Override
+  public int compare(int i, int j) {
+return Integer.compare(ordinals[i], ordinals[j]);
+  }
+}.sort(0, ordinalsLength);
+
+int readerIndex;
+int leafReaderMaxDoc = 0;
+int leafReaderDocBase = 0;
+LeafReader leafReader;
+LeafReaderContext leafReaderContext;
+BinaryDocValues values = null;
+List uncachedOrdinalPositions = new ArrayList<>();
+
+for (int i = 0; i < ordinalsLength; i++) {
+  if (bulkPath[originalPosition[i]] == null) {
+/*
+If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find 
the next leaf that contains our ordinal.
+Remember: ordinals[i] operates in the global ordinal space and hence 
we add leafReaderDocBase to the leafReaderMaxDoc
+(which is the maxDoc of the specific leaf)
+ */
+if (values == null || ordinals[i] >= leafReaderDocBase + 
leafReaderMaxDoc) {
+
+  readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+  leafReaderContext = indexReader.leaves().get(readerIndex);
+  leafReader = leafReaderContext.reader();
+  leafReaderMaxDoc = leafReader.maxDoc();
+  leafReaderDocBase = leafReaderContext.docBase;
+  values = leafReader.getBinaryDocValues(Consts.FULL);
+
+  /*
+  If the index is constructed with the older StoredFields it will not 
have any BinaryDocValues field and will return null
+   */
+  if (values == null) {

Review comment:
   Sure. I'll create it after this PR is merged. Otherwise, it just feels 
weird to create an issue for a fix in a piece of code that has not been merged. 
   
   The fix should be trivial actually. We c

[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-29 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698080163



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel[] getPathFromCache(int... ordinals) {
+FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+// TODO LUCENE-10068: can we use an int-based hash impl, such as 
IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  for (int i = 0; i < ordinals.length; i++) {
+facetLabels[i] = categoryCache.get(ordinals[i]);
+  }
+}
+return facetLabels;
+  }
+
+  /**
+   * Checks if the ordinals in the array are >=0 and < {@code
+   * DirectoryTaxonomyReader#indexReader.maxDoc()}
+   *
+   * @param ordinals Integer array of ordinals
+   * @throws IllegalArgumentException Throw an IllegalArgumentException if one 
of the ordinals is
+   * out of bounds
+   */
+  private void checkOrdinalBounds(int... ordinals) throws 
IllegalArgumentException {
+for (int ordinal : ordinals) {
+  if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+throw new IllegalArgumentException(
+"ordinal "
++ ordinal
++ " is out of the range of the indexReader "
++ indexReader.toString()
++ ". The maximum possible ordinal number is "
++ (indexReader.maxDoc() - 1));
+  }
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   * index
+   */
+  @Override
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ensureOpen();
+checkOrdinalBounds(ordinals);
+
+int ordinalsLength = ordinals.length;
+FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+// remember the original positions of ordinals before they are sorted
+int[] originalPosition = new int[ordinalsLength];
+Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+getPathFromCache(ordinals);
+
+/* parallel sort the ordinals and originalPosition array based on the 
values in the ordinals array */
+new InPlaceMergeSorter() {
+  @Override
+  protected void swap(int i, int j) {
+int x = ordinals[i];
+ordinals[i] = ordinals[j];
+ordinals[j] = x;
+
+x = originalPosition[i];
+originalPosition[i] = originalPosition[j];
+originalPosition[j] = x;
+  }
+
+  @Override
+  public int compare(int i, int j) {
+return Integer.compare(ordinals[i], ordinals[j]);
+  }
+}.sort(0, ordinalsLength);
+
+int readerIndex;
+int leafReaderMaxDoc = 0;
+int leafReaderDocBase = 0;
+LeafReader leafReader;
+LeafReaderContext leafReaderContext;
+BinaryDocValues values = null;
+List uncachedOrdinalPositions = new ArrayList<>();
+
+for (int i = 0; i < ordinalsLength; i++) {
+  if (bulkPath[originalPosition[i]] == null) {
+/*
+If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find 
the next leaf that contains our ordinal.
+Remember: ordinals[i] operates in the global ordinal space and hence 
we add leafReaderDocBase to the leafReaderMaxDoc
+(which is the maxDoc of the specific leaf)
+ */
+if (values == null || ordinals[i] >= leafReaderDocBase + 
leafReaderMaxDoc) {
+
+  readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+  leafReaderContext = indexReader.leaves().get(readerIndex);
+  leafReader = leafReaderContext.reader();
+  leafReaderMaxDoc = leafReader.maxDoc();
+  leafReaderDocBase = leafReaderContext.docBase;
+  values = leafReader.getBinaryDocValues(Consts.FULL);
+
+  /*
+  If the index is constructed with the older StoredFields it will not 
have any BinaryDocValues field and will return null
+   */
+  if (values == null) {

Review comment:
   Sure. I'll create it after this PR is merged. Otherwise, it just feels 
weird to create an issue for a fix in a piece of code that has not been merged. 
   




-- 
This is an automated message from t

[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-29 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698080163



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel[] getPathFromCache(int... ordinals) {
+FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+// TODO LUCENE-10068: can we use an int-based hash impl, such as 
IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  for (int i = 0; i < ordinals.length; i++) {
+facetLabels[i] = categoryCache.get(ordinals[i]);
+  }
+}
+return facetLabels;
+  }
+
+  /**
+   * Checks if the ordinals in the array are >=0 and < {@code
+   * DirectoryTaxonomyReader#indexReader.maxDoc()}
+   *
+   * @param ordinals Integer array of ordinals
+   * @throws IllegalArgumentException Throw an IllegalArgumentException if one 
of the ordinals is
+   * out of bounds
+   */
+  private void checkOrdinalBounds(int... ordinals) throws 
IllegalArgumentException {
+for (int ordinal : ordinals) {
+  if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+throw new IllegalArgumentException(
+"ordinal "
++ ordinal
++ " is out of the range of the indexReader "
++ indexReader.toString()
++ ". The maximum possible ordinal number is "
++ (indexReader.maxDoc() - 1));
+  }
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   * index
+   */
+  @Override
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ensureOpen();
+checkOrdinalBounds(ordinals);
+
+int ordinalsLength = ordinals.length;
+FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+// remember the original positions of ordinals before they are sorted
+int[] originalPosition = new int[ordinalsLength];
+Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+getPathFromCache(ordinals);
+
+/* parallel sort the ordinals and originalPosition array based on the 
values in the ordinals array */
+new InPlaceMergeSorter() {
+  @Override
+  protected void swap(int i, int j) {
+int x = ordinals[i];
+ordinals[i] = ordinals[j];
+ordinals[j] = x;
+
+x = originalPosition[i];
+originalPosition[i] = originalPosition[j];
+originalPosition[j] = x;
+  }
+
+  @Override
+  public int compare(int i, int j) {
+return Integer.compare(ordinals[i], ordinals[j]);
+  }
+}.sort(0, ordinalsLength);
+
+int readerIndex;
+int leafReaderMaxDoc = 0;
+int leafReaderDocBase = 0;
+LeafReader leafReader;
+LeafReaderContext leafReaderContext;
+BinaryDocValues values = null;
+List uncachedOrdinalPositions = new ArrayList<>();
+
+for (int i = 0; i < ordinalsLength; i++) {
+  if (bulkPath[originalPosition[i]] == null) {
+/*
+If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find 
the next leaf that contains our ordinal.
+Remember: ordinals[i] operates in the global ordinal space and hence 
we add leafReaderDocBase to the leafReaderMaxDoc
+(which is the maxDoc of the specific leaf)
+ */
+if (values == null || ordinals[i] >= leafReaderDocBase + 
leafReaderMaxDoc) {
+
+  readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+  leafReaderContext = indexReader.leaves().get(readerIndex);
+  leafReader = leafReaderContext.reader();
+  leafReaderMaxDoc = leafReader.maxDoc();
+  leafReaderDocBase = leafReaderContext.docBase;
+  values = leafReader.getBinaryDocValues(Consts.FULL);
+
+  /*
+  If the index is constructed with the older StoredFields it will not 
have any BinaryDocValues field and will return null
+   */
+  if (values == null) {

Review comment:
   Sure. I'll create it after this PR is merged. Otherwise, it just feels 
weird to create an issue for a fix in a piece of code that has not been merged 
:D




-- 
This is an automated message from the

[GitHub] [lucene] zacharymorn commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index check across segments

2021-08-29 Thread GitBox


zacharymorn commented on pull request #128:
URL: https://github.com/apache/lucene/pull/128#issuecomment-907952441


   > Thanks @zacharymorn -- the changes look awesome! I'm looking forward to 
faster `CheckIndex`!
   
   Thanks @mikemccand again for the review and approval, and I look forward to 
the speedup as well! I'll merge in a few days just in case if anyone else may 
have further comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10059) Assertion error in JapaneseTokenizer backtrace

2021-08-29 Thread Anh Dung Bui (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406501#comment-17406501
 ] 

Anh Dung Bui commented on LUCENE-10059:
---

Thanks [~mikemccand], we already backported to Lucene 8.x. I'll resolve this 
and create follow-up JIRA.

> Assertion error in JapaneseTokenizer backtrace
> --
>
> Key: LUCENE-10059
> URL: https://issues.apache.org/jira/browse/LUCENE-10059
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.8
>Reporter: Anh Dung Bui
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> There is a rare case which causes an AssertionError in the backtrace step of 
> JapaneseTokenizer that we (Amazon Product Search) found in our tests.
> If there is a text span of length 1024 (determined by 
> [MAX_BACKTRACE_GAP|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L116])
>  where the regular backtrace is not called, a [forced 
> backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L781]
>  will be applied. If the partially best path at this point happens to end at 
> the last pos, and since there is always a [final 
> backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L1044]
>  applied at the end, the final backtrace will try to backtrace from and to 
> the same position, causing an AssertionError in RollingCharBuffer.get() when 
> it tries to generate an empty buffer.
> We are fixing it by returning prematurely in the backtrace() method when the 
> from and to pos are the same:
> {code:java}
> if (endPos == lastBackTracePos) {
>   return;
> }
> {code}
> The backtrace() method is essentially no-op when this condition happens, thus 
> when _-ea_ is not enabled, it can still output the correct tokens.
> We will open a PR for this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10059) Assertion error in JapaneseTokenizer backtrace

2021-08-29 Thread Anh Dung Bui (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anh Dung Bui updated LUCENE-10059:
--
Fix Version/s: main (9.0)
   8.x

> Assertion error in JapaneseTokenizer backtrace
> --
>
> Key: LUCENE-10059
> URL: https://issues.apache.org/jira/browse/LUCENE-10059
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.8
>Reporter: Anh Dung Bui
>Priority: Major
> Fix For: 8.x, main (9.0)
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> There is a rare case which causes an AssertionError in the backtrace step of 
> JapaneseTokenizer that we (Amazon Product Search) found in our tests.
> If there is a text span of length 1024 (determined by 
> [MAX_BACKTRACE_GAP|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L116])
>  where the regular backtrace is not called, a [forced 
> backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L781]
>  will be applied. If the partially best path at this point happens to end at 
> the last pos, and since there is always a [final 
> backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L1044]
>  applied at the end, the final backtrace will try to backtrace from and to 
> the same position, causing an AssertionError in RollingCharBuffer.get() when 
> it tries to generate an empty buffer.
> We are fixing it by returning prematurely in the backtrace() method when the 
> from and to pos are the same:
> {code:java}
> if (endPos == lastBackTracePos) {
>   return;
> }
> {code}
> The backtrace() method is essentially no-op when this condition happens, thus 
> when _-ea_ is not enabled, it can still output the correct tokens.
> We will open a PR for this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10059) Assertion error in JapaneseTokenizer backtrace

2021-08-29 Thread Anh Dung Bui (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anh Dung Bui resolved LUCENE-10059.
---
Resolution: Fixed

> Assertion error in JapaneseTokenizer backtrace
> --
>
> Key: LUCENE-10059
> URL: https://issues.apache.org/jira/browse/LUCENE-10059
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.8
>Reporter: Anh Dung Bui
>Priority: Major
> Fix For: 8.x, main (9.0)
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> There is a rare case which causes an AssertionError in the backtrace step of 
> JapaneseTokenizer that we (Amazon Product Search) found in our tests.
> If there is a text span of length 1024 (determined by 
> [MAX_BACKTRACE_GAP|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L116])
>  where the regular backtrace is not called, a [forced 
> backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L781]
>  will be applied. If the partially best path at this point happens to end at 
> the last pos, and since there is always a [final 
> backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L1044]
>  applied at the end, the final backtrace will try to backtrace from and to 
> the same position, causing an AssertionError in RollingCharBuffer.get() when 
> it tries to generate an empty buffer.
> We are fixing it by returning prematurely in the backtrace() method when the 
> from and to pos are the same:
> {code:java}
> if (endPos == lastBackTracePos) {
>   return;
> }
> {code}
> The backtrace() method is essentially no-op when this condition happens, thus 
> when _-ea_ is not enabled, it can still output the correct tokens.
> We will open a PR for this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8723) Bad interaction bewteen WordDelimiterGraphFilter, StopFilter and FlattenGraphFilter

2021-08-29 Thread Geoffrey Lawson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406516#comment-17406516
 ] 

Geoffrey Lawson commented on LUCENE-8723:
-

I believe this issue has been fixed in LUCENE-9963. The test 
{{testAltPathLastStepHoleWithoutEndToken}} in {{TestFlattenGraphFilter}} is 
based on this problem. There were some issues with {{FlattenGraphFilter}} not 
correctly handling deleted tokens. Both test cases in this issue appear to be 
working now.



> Bad interaction bewteen WordDelimiterGraphFilter, StopFilter and 
> FlattenGraphFilter
> ---
>
> Key: LUCENE-8723
> URL: https://issues.apache.org/jira/browse/LUCENE-8723
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.7.1, 8.0, 8.3
>Reporter: Nicolás Lichtmaier
>Priority: Major
>
> I was debugging an issue (missing tokens after analysis) and when I enabled 
> Java assertions I uncovered a bug when using WordDelimiterGraphFilter + 
> StopFilter + FlattenGraphFilter.
> I could reproduce the issue in a small piece of code. This code gives an 
> assertion failure when assertions are enabled (-ea java option):
> {code:java}
>     Builder builder = CustomAnalyzer.builder();
>     builder.withTokenizer(StandardTokenizerFactory.class);
>     builder.addTokenFilter(WordDelimiterGraphFilterFactory.class, 
> "preserveOriginal", "1");
>     builder.addTokenFilter(StopFilterFactory.class);
>     builder.addTokenFilter(FlattenGraphFilterFactory.class);
>     Analyzer analyzer = builder.build();
>      
>     TokenStream ts = analyzer.tokenStream("*", new StringReader("x7in"));
>     ts.reset();
>     while(ts.incrementToken())
>         ;
> {code}
> This gives:
> {code}
> Exception in thread "main" java.lang.AssertionError: 2
>      at 
> org.apache.lucene.analysis.core.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:195)
>      at 
> org.apache.lucene.analysis.core.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:258)
>      at com.wolfram.textsearch.AnalyzerError.main(AnalyzerError.java:32)
> {code}
> Maybe removing stop words after WordDelimiterGraphFilter is wrong, I don't 
> know. However is the only way to process stop-words generated by that filter. 
> In any case, it should not eat tokens or produce assertions. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org