[GitHub] [lucene] zacharymorn commented on a change in pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index check across segments
zacharymorn commented on a change in pull request #128: URL: https://github.com/apache/lucene/pull/128#discussion_r697971217 ## File path: lucene/test-framework/src/java/org/apache/lucene/util/TestUtil.java ## @@ -321,6 +326,11 @@ public static void syncConcurrentMerges(MergeScheduler ms) { checker.setDoSlowChecks(doSlowChecks); checker.setFailFast(failFast); checker.setInfoStream(new PrintStream(output, false, IOUtils.UTF_8), false); + if (concurrent) { +checker.setThreadCount(RandomizedTest.randomIntBetween(1, 5)); + } else { +checker.setThreadCount(0); Review comment: Ok after some more thoughts I do think that makes sense, particularly around the idea of this (`main` vs. `executor`) is really an implementation detail that users shouldn't need to worry about, and so `-threadCount 1` to represent single-threaded execution would be the most intuitive approach here. I've pushed a new commit to update this accordingly. Also, in the latest commit I've made it the default that when users don't specify `-threadCount` value via command line, concurrent index checking will be used on machines with more than 1 core, but the number of threads will be capped at 4 via `Math.min(Runtime.getRuntime().availableProcessors(), 4)`. I think this default behavior was discussed in multiple places in this PR, but would like to double confirm that this is the preferred default setting we would like to have (versus sequential index checking)? ## File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java ## @@ -450,6 +480,14 @@ public void setChecksumsOnly(boolean v) { private boolean checksumsOnly; + /** Set threadCount used for parallelizing index integrity checking. */ + public void setThreadCount(int tc) { +threadCount = tc; Review comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wuda0112 commented on pull request #224: LUCENE-10035: Simple text codec add multi level skip list data
wuda0112 commented on pull request #224: URL: https://github.com/apache/lucene/pull/224#issuecomment-907748573 > So awesome that `SimpleText` is finally getting a skipping implementation! Thank you @wuda0112! > > Have you tried stressing out the new code by running all Lucene unit-tests with `SimpleText`? Something like `./gradlew -Dtests.codec=SimpleText`. @mikemccand Thanks for your reply! I have execute `./gradlew check` `./gradlew test` `./gradlew -p lucene/core/ test` `./gradlew -p lucene/codecs/ test`, and all passed! But when i execute `./gradlew test -Dtests.codec=SimpleText`, there has one FAILED, the exception is: ``` org.apache.lucene.demo.TestDemo > testKnnVectorSearch FAILED java.lang.UnsupportedOperationException at __randomizedtesting.SeedInfo.seed([2DAF064A760C3554:7F0D5D18D85E7028]:0) at org.apache.lucene.codecs.simpletext.SimpleTextKnnVectorsReader.search(SimpleTextKnnVectorsReader.java:143) ``` when i look into SimpleTextKnnVectorsReader#search, it `throw new UnsupportedOperationException` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10068) Switch to a "double barrel" HPPC cache for the taxonomy LRU cache
[ https://issues.apache.org/jira/browse/LUCENE-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406403#comment-17406403 ] Michael McCandless commented on LUCENE-10068: - Yeah that's a good point [~rcmuir] – maybe we should just remove this cache! Especially with the upcoming {{getBulkPath}} it's likely this cache is not helpful anymore. So +1 to first test benchmarks with and without the cache and then maybe nuke it entirely. > Switch to a "double barrel" HPPC cache for the taxonomy LRU cache > - > > Key: LUCENE-10068 > URL: https://issues.apache.org/jira/browse/LUCENE-10068 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 8.8.1 >Reporter: Gautam Worah >Priority: Minor > > While working on an unrelated getBulkPath API > [PR|https://github.com/apache/lucene/pull/179], [~mikemccand] and I came > across a nice optimization that could be made to the taxonomy cache. > The taxonomy cache today caches frequently used ordinals and their > corresponding FacetLabels. It uses the existing LRUHashMap (backed by a > LinkedList) class for its implementation. > This implementation performs sub optimally when it has a large number of > threads accessing it, and consumes a large amount of RAM. > [~mikemccand] suggested the idea of a two array backed HPPC int->FacetLabel > cache. The basic idea behind the cache being: > # We use two hashmaps primary and secondary. > # In case of a cache miss in the primary and a cache hit in the secondary, > we add the key to the primary map as well. > # In case of a cache miss in both the maps, we add it to the primary map. > # When we reach (make this check each time we insert?) a large number of > elements in say the primary cache, (say larger than the existing > {color:#871094}DEFAULT_CACHE_VALUE{color}=4000), we dump the secondary map > and copy all the values of the primary map into it. > The idea was originally explained in > [this|https://github.com/apache/lucene/pull/179#discussion_r692907559] > comment. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #224: LUCENE-10035: Simple text codec add multi level skip list data
mikemccand commented on pull request #224: URL: https://github.com/apache/lucene/pull/224#issuecomment-907786967 > > So awesome that `SimpleText` is finally getting a skipping implementation! Thank you @wuda0112! > > Have you tried stressing out the new code by running all Lucene unit-tests with `SimpleText`? Something like `./gradlew -Dtests.codec=SimpleText`. > > @mikemccand Thanks for your reply! I have execute `./gradlew check` `./gradlew test` `./gradlew -p lucene/core/ test` `./gradlew -p lucene/codecs/ test`, and all passed! > But when i execute `./gradlew test -Dtests.codec=SimpleText`, there has one FAILED, the exception is: > > ``` > org.apache.lucene.demo.TestDemo > testKnnVectorSearch FAILED > java.lang.UnsupportedOperationException > at __randomizedtesting.SeedInfo.seed([2DAF064A760C3554:7F0D5D18D85E7028]:0) > at org.apache.lucene.codecs.simpletext.SimpleTextKnnVectorsReader.search(SimpleTextKnnVectorsReader.java:143) > ``` > > when i look into SimpleTextKnnVectorsReader#search, it `throw new UnsupportedOperationException` Ahh it looks like `SimpleText` is missing its KNN implementation but looks like it's coming soon! https://issues.apache.org/jira/browse/LUCENE-10063. In the meantime that failing test should add an `assume` that the current codec is not `SimpleText`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index check across segments
mikemccand commented on a change in pull request #128: URL: https://github.com/apache/lucene/pull/128#discussion_r698012354 ## File path: lucene/test-framework/src/java/org/apache/lucene/util/TestUtil.java ## @@ -321,6 +326,11 @@ public static void syncConcurrentMerges(MergeScheduler ms) { checker.setDoSlowChecks(doSlowChecks); checker.setFailFast(failFast); checker.setInfoStream(new PrintStream(output, false, IOUtils.UTF_8), false); + if (concurrent) { +checker.setThreadCount(RandomizedTest.randomIntBetween(1, 5)); + } else { +checker.setThreadCount(0); Review comment: Great thanks @zacharymorn -- that sounds like a good approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wuda0112 commented on pull request #224: LUCENE-10035: Simple text codec add multi level skip list data
wuda0112 commented on pull request #224: URL: https://github.com/apache/lucene/pull/224#issuecomment-907810690 > > > So awesome that `SimpleText` is finally getting a skipping implementation! Thank you @wuda0112! > > > Have you tried stressing out the new code by running all Lucene unit-tests with `SimpleText`? Something like `./gradlew -Dtests.codec=SimpleText`. > > > > > > @mikemccand Thanks for your reply! I have execute `./gradlew check` `./gradlew test` `./gradlew -p lucene/core/ test` `./gradlew -p lucene/codecs/ test`, and all passed! > > But when i execute `./gradlew test -Dtests.codec=SimpleText`, there has one FAILED, the exception is: > > ``` > > org.apache.lucene.demo.TestDemo > testKnnVectorSearch FAILED > > java.lang.UnsupportedOperationException > > at __randomizedtesting.SeedInfo.seed([2DAF064A760C3554:7F0D5D18D85E7028]:0) > > at org.apache.lucene.codecs.simpletext.SimpleTextKnnVectorsReader.search(SimpleTextKnnVectorsReader.java:143) > > ``` > > > > > > > > > > > > > > > > > > > > > > > > when i look into SimpleTextKnnVectorsReader#search, it `throw new UnsupportedOperationException` > > Ahh it looks like `SimpleText` is missing its KNN implementation but looks like it's coming soon! https://issues.apache.org/jira/browse/LUCENE-10063. In the meantime that failing test should add an `assume` that the current codec is not `SimpleText`. Ok, got it, thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #271: LUCENE-9969:TaxoArrays, a member variable of the DirectoryTaxonomyReader class, i…
rmuir commented on pull request #271: URL: https://github.com/apache/lucene/pull/271#issuecomment-907829353 could we look at storing this stuff as docvalues instead of as payloads that are read into heap memory? It is just integers or list of integers right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
gautamworah96 commented on a change in pull request #179: URL: https://github.com/apache/lucene/pull/179#discussion_r698055359 ## File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java ## @@ -89,6 +89,28 @@ private void createNewTaxonomyIndex(String dirName) throws IOException { dir.close(); } + // Opens up a pre-existing index and tries to run getBulkPath on it + public void testGetBulkPathOnOlderCodec() throws Exception { +Path indexDir = createTempDir(oldTaxonomyIndexName); +TestUtil.unzip(getDataInputStream(oldTaxonomyIndexName + ".zip"), indexDir); +Directory dir = newFSDirectory(indexDir); + +DirectoryTaxonomyWriter writer = new DirectoryTaxonomyWriter(dir); +DirectoryTaxonomyReader reader = new DirectoryTaxonomyReader(writer); + +FacetLabel[] facetLabels = { + new FacetLabel("a"), new FacetLabel("b"), +}; +int[] ordinals = +new int[] {reader.getOrdinal(facetLabels[0]), reader.getOrdinal(facetLabels[1])}; +// The zip file already contains a category "a" and a category "b" stored with the older Review comment: We have already added this condition in the `createNewTaxonomyIndex()` test (i.e the back compat test we had added for the BDV switch). See [here](https://github.com/apache/lucene/blob/56eb76dbaf7b2e9a4fbf7c1a25af2ba81ab1bc92/lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java#L74) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
gautamworah96 commented on a change in pull request #179: URL: https://github.com/apache/lucene/pull/179#discussion_r698056689 ## File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyCombined.java ## @@ -366,9 +366,9 @@ public void testReaderBasic() throws Exception { } } // (also test invalid ordinals:) -assertNull(tr.getPath(-1)); -assertNull(tr.getPath(tr.getSize())); -assertNull(tr.getPath(TaxonomyReader.INVALID_ORDINAL)); +expectThrows(IllegalArgumentException.class, () -> tr.getPath(-1)); Review comment: Sure. Added it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10077) Closing the DirTaxonomyReader while another thread access the cache can throw NPE
Gautam Worah created LUCENE-10077: - Summary: Closing the DirTaxonomyReader while another thread access the cache can throw NPE Key: LUCENE-10077 URL: https://issues.apache.org/jira/browse/LUCENE-10077 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Affects Versions: main (9.0) Reporter: Gautam Worah When we close a {{DirectoryTaxonomyReader}} in {{doClose}}, we set the {{categoryCache}} to null. But if a thread is next after this {{doClose}} call, it will still try to acquire a lock and {{synchronize}} on it. This will result in an NPE. This works well today, because we operate on the assumption that the user will always call {{doClose}} after all threads have completed. One suggestion by [~mikemccand] in this [PR|https://github.com/apache/lucene/pull/179#discussion_r697880516] was to make categoryCache final and throw an AlreadyClosedException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
gautamworah96 commented on a change in pull request #179: URL: https://github.com/apache/lucene/pull/179#discussion_r698061294 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java ## @@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException { } synchronized (categoryCache) { - categoryCache.put(catIDInteger, ret); + categoryCache.put(ordinal, ret); } return ret; } + private FacetLabel[] getPathFromCache(int... ordinals) { +FacetLabel[] facetLabels = new FacetLabel[ordinals.length]; +// TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap, +// wrapped as LRU? +synchronized (categoryCache) { + for (int i = 0; i < ordinals.length; i++) { +facetLabels[i] = categoryCache.get(ordinals[i]); + } +} +return facetLabels; + } + + /** + * Checks if the ordinals in the array are >=0 and < {@code + * DirectoryTaxonomyReader#indexReader.maxDoc()} + * + * @param ordinals Integer array of ordinals + * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is + * out of bounds + */ + private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException { +for (int ordinal : ordinals) { + if (ordinal < 0 || ordinal >= indexReader.maxDoc()) { +throw new IllegalArgumentException( +"ordinal " ++ ordinal ++ " is out of the range of the indexReader " ++ indexReader.toString() ++ ". The maximum possible ordinal number is " ++ (indexReader.maxDoc() - 1)); + } +} + } + + /** + * Returns an array of FacetLabels for a given array of ordinals. + * + * This API is generally faster than iteratively calling {@link #getPath(int)} over an array of + * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index + * was created using StoredFields (with no performance gains) and uses DocValues based iteration + * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0 + * + * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy + * index + */ + @Override + public FacetLabel[] getBulkPath(int... ordinals) throws IOException { +ensureOpen(); +checkOrdinalBounds(ordinals); + +int ordinalsLength = ordinals.length; +FacetLabel[] bulkPath = new FacetLabel[ordinalsLength]; +// remember the original positions of ordinals before they are sorted +int[] originalPosition = new int[ordinalsLength]; +Arrays.setAll(originalPosition, IntUnaryOperator.identity()); + +getPathFromCache(ordinals); Review comment: Ahh. This can indeed happen. Opened https://issues.apache.org/jira/browse/LUCENE-10077 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
gautamworah96 commented on a change in pull request #179: URL: https://github.com/apache/lucene/pull/179#discussion_r698080163 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java ## @@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws IOException { } synchronized (categoryCache) { - categoryCache.put(catIDInteger, ret); + categoryCache.put(ordinal, ret); } return ret; } + private FacetLabel[] getPathFromCache(int... ordinals) { +FacetLabel[] facetLabels = new FacetLabel[ordinals.length]; +// TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap, +// wrapped as LRU? +synchronized (categoryCache) { + for (int i = 0; i < ordinals.length; i++) { +facetLabels[i] = categoryCache.get(ordinals[i]); + } +} +return facetLabels; + } + + /** + * Checks if the ordinals in the array are >=0 and < {@code + * DirectoryTaxonomyReader#indexReader.maxDoc()} + * + * @param ordinals Integer array of ordinals + * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is + * out of bounds + */ + private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException { +for (int ordinal : ordinals) { + if (ordinal < 0 || ordinal >= indexReader.maxDoc()) { +throw new IllegalArgumentException( +"ordinal " ++ ordinal ++ " is out of the range of the indexReader " ++ indexReader.toString() ++ ". The maximum possible ordinal number is " ++ (indexReader.maxDoc() - 1)); + } +} + } + + /** + * Returns an array of FacetLabels for a given array of ordinals. + * + * This API is generally faster than iteratively calling {@link #getPath(int)} over an array of + * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index + * was created using StoredFields (with no performance gains) and uses DocValues based iteration + * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0 + * + * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy + * index + */ + @Override + public FacetLabel[] getBulkPath(int... ordinals) throws IOException { +ensureOpen(); +checkOrdinalBounds(ordinals); + +int ordinalsLength = ordinals.length; +FacetLabel[] bulkPath = new FacetLabel[ordinalsLength]; +// remember the original positions of ordinals before they are sorted +int[] originalPosition = new int[ordinalsLength]; +Arrays.setAll(originalPosition, IntUnaryOperator.identity()); + +getPathFromCache(ordinals); + +/* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */ +new InPlaceMergeSorter() { + @Override + protected void swap(int i, int j) { +int x = ordinals[i]; +ordinals[i] = ordinals[j]; +ordinals[j] = x; + +x = originalPosition[i]; +originalPosition[i] = originalPosition[j]; +originalPosition[j] = x; + } + + @Override + public int compare(int i, int j) { +return Integer.compare(ordinals[i], ordinals[j]); + } +}.sort(0, ordinalsLength); + +int readerIndex; +int leafReaderMaxDoc = 0; +int leafReaderDocBase = 0; +LeafReader leafReader; +LeafReaderContext leafReaderContext; +BinaryDocValues values = null; +List uncachedOrdinalPositions = new ArrayList<>(); + +for (int i = 0; i < ordinalsLength; i++) { + if (bulkPath[originalPosition[i]] == null) { +/* +If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal. +Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc +(which is the maxDoc of the specific leaf) + */ +if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) { + + readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves()); + leafReaderContext = indexReader.leaves().get(readerIndex); + leafReader = leafReaderContext.reader(); + leafReaderMaxDoc = leafReader.maxDoc(); + leafReaderDocBase = leafReaderContext.docBase; + values = leafReader.getBinaryDocValues(Consts.FULL); + + /* + If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null + */ + if (values == null) { Review comment: Sure. I'll create it after this PR is merged. Otherwise, it just feels weird to create an issue for a fix in a piece of code that has not been merged. -- This is an automated message from the A
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
gautamworah96 commented on a change in pull request #179: URL: https://github.com/apache/lucene/pull/179#discussion_r698080163 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java ## @@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws IOException { } synchronized (categoryCache) { - categoryCache.put(catIDInteger, ret); + categoryCache.put(ordinal, ret); } return ret; } + private FacetLabel[] getPathFromCache(int... ordinals) { +FacetLabel[] facetLabels = new FacetLabel[ordinals.length]; +// TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap, +// wrapped as LRU? +synchronized (categoryCache) { + for (int i = 0; i < ordinals.length; i++) { +facetLabels[i] = categoryCache.get(ordinals[i]); + } +} +return facetLabels; + } + + /** + * Checks if the ordinals in the array are >=0 and < {@code + * DirectoryTaxonomyReader#indexReader.maxDoc()} + * + * @param ordinals Integer array of ordinals + * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is + * out of bounds + */ + private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException { +for (int ordinal : ordinals) { + if (ordinal < 0 || ordinal >= indexReader.maxDoc()) { +throw new IllegalArgumentException( +"ordinal " ++ ordinal ++ " is out of the range of the indexReader " ++ indexReader.toString() ++ ". The maximum possible ordinal number is " ++ (indexReader.maxDoc() - 1)); + } +} + } + + /** + * Returns an array of FacetLabels for a given array of ordinals. + * + * This API is generally faster than iteratively calling {@link #getPath(int)} over an array of + * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index + * was created using StoredFields (with no performance gains) and uses DocValues based iteration + * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0 + * + * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy + * index + */ + @Override + public FacetLabel[] getBulkPath(int... ordinals) throws IOException { +ensureOpen(); +checkOrdinalBounds(ordinals); + +int ordinalsLength = ordinals.length; +FacetLabel[] bulkPath = new FacetLabel[ordinalsLength]; +// remember the original positions of ordinals before they are sorted +int[] originalPosition = new int[ordinalsLength]; +Arrays.setAll(originalPosition, IntUnaryOperator.identity()); + +getPathFromCache(ordinals); + +/* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */ +new InPlaceMergeSorter() { + @Override + protected void swap(int i, int j) { +int x = ordinals[i]; +ordinals[i] = ordinals[j]; +ordinals[j] = x; + +x = originalPosition[i]; +originalPosition[i] = originalPosition[j]; +originalPosition[j] = x; + } + + @Override + public int compare(int i, int j) { +return Integer.compare(ordinals[i], ordinals[j]); + } +}.sort(0, ordinalsLength); + +int readerIndex; +int leafReaderMaxDoc = 0; +int leafReaderDocBase = 0; +LeafReader leafReader; +LeafReaderContext leafReaderContext; +BinaryDocValues values = null; +List uncachedOrdinalPositions = new ArrayList<>(); + +for (int i = 0; i < ordinalsLength; i++) { + if (bulkPath[originalPosition[i]] == null) { +/* +If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal. +Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc +(which is the maxDoc of the specific leaf) + */ +if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) { + + readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves()); + leafReaderContext = indexReader.leaves().get(readerIndex); + leafReader = leafReaderContext.reader(); + leafReaderMaxDoc = leafReader.maxDoc(); + leafReaderDocBase = leafReaderContext.docBase; + values = leafReader.getBinaryDocValues(Consts.FULL); + + /* + If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null + */ + if (values == null) { Review comment: Sure. I'll create it after this PR is merged. Otherwise, it just feels weird to create an issue for a fix in a piece of code that has not been merged. The fix should be trivial actually. We c
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
gautamworah96 commented on a change in pull request #179: URL: https://github.com/apache/lucene/pull/179#discussion_r698080163 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java ## @@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws IOException { } synchronized (categoryCache) { - categoryCache.put(catIDInteger, ret); + categoryCache.put(ordinal, ret); } return ret; } + private FacetLabel[] getPathFromCache(int... ordinals) { +FacetLabel[] facetLabels = new FacetLabel[ordinals.length]; +// TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap, +// wrapped as LRU? +synchronized (categoryCache) { + for (int i = 0; i < ordinals.length; i++) { +facetLabels[i] = categoryCache.get(ordinals[i]); + } +} +return facetLabels; + } + + /** + * Checks if the ordinals in the array are >=0 and < {@code + * DirectoryTaxonomyReader#indexReader.maxDoc()} + * + * @param ordinals Integer array of ordinals + * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is + * out of bounds + */ + private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException { +for (int ordinal : ordinals) { + if (ordinal < 0 || ordinal >= indexReader.maxDoc()) { +throw new IllegalArgumentException( +"ordinal " ++ ordinal ++ " is out of the range of the indexReader " ++ indexReader.toString() ++ ". The maximum possible ordinal number is " ++ (indexReader.maxDoc() - 1)); + } +} + } + + /** + * Returns an array of FacetLabels for a given array of ordinals. + * + * This API is generally faster than iteratively calling {@link #getPath(int)} over an array of + * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index + * was created using StoredFields (with no performance gains) and uses DocValues based iteration + * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0 + * + * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy + * index + */ + @Override + public FacetLabel[] getBulkPath(int... ordinals) throws IOException { +ensureOpen(); +checkOrdinalBounds(ordinals); + +int ordinalsLength = ordinals.length; +FacetLabel[] bulkPath = new FacetLabel[ordinalsLength]; +// remember the original positions of ordinals before they are sorted +int[] originalPosition = new int[ordinalsLength]; +Arrays.setAll(originalPosition, IntUnaryOperator.identity()); + +getPathFromCache(ordinals); + +/* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */ +new InPlaceMergeSorter() { + @Override + protected void swap(int i, int j) { +int x = ordinals[i]; +ordinals[i] = ordinals[j]; +ordinals[j] = x; + +x = originalPosition[i]; +originalPosition[i] = originalPosition[j]; +originalPosition[j] = x; + } + + @Override + public int compare(int i, int j) { +return Integer.compare(ordinals[i], ordinals[j]); + } +}.sort(0, ordinalsLength); + +int readerIndex; +int leafReaderMaxDoc = 0; +int leafReaderDocBase = 0; +LeafReader leafReader; +LeafReaderContext leafReaderContext; +BinaryDocValues values = null; +List uncachedOrdinalPositions = new ArrayList<>(); + +for (int i = 0; i < ordinalsLength; i++) { + if (bulkPath[originalPosition[i]] == null) { +/* +If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal. +Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc +(which is the maxDoc of the specific leaf) + */ +if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) { + + readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves()); + leafReaderContext = indexReader.leaves().get(readerIndex); + leafReader = leafReaderContext.reader(); + leafReaderMaxDoc = leafReader.maxDoc(); + leafReaderDocBase = leafReaderContext.docBase; + values = leafReader.getBinaryDocValues(Consts.FULL); + + /* + If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null + */ + if (values == null) { Review comment: Sure. I'll create it after this PR is merged. Otherwise, it just feels weird to create an issue for a fix in a piece of code that has not been merged. The fix should be trivial actually. We c
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
gautamworah96 commented on a change in pull request #179: URL: https://github.com/apache/lucene/pull/179#discussion_r698080163 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java ## @@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws IOException { } synchronized (categoryCache) { - categoryCache.put(catIDInteger, ret); + categoryCache.put(ordinal, ret); } return ret; } + private FacetLabel[] getPathFromCache(int... ordinals) { +FacetLabel[] facetLabels = new FacetLabel[ordinals.length]; +// TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap, +// wrapped as LRU? +synchronized (categoryCache) { + for (int i = 0; i < ordinals.length; i++) { +facetLabels[i] = categoryCache.get(ordinals[i]); + } +} +return facetLabels; + } + + /** + * Checks if the ordinals in the array are >=0 and < {@code + * DirectoryTaxonomyReader#indexReader.maxDoc()} + * + * @param ordinals Integer array of ordinals + * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is + * out of bounds + */ + private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException { +for (int ordinal : ordinals) { + if (ordinal < 0 || ordinal >= indexReader.maxDoc()) { +throw new IllegalArgumentException( +"ordinal " ++ ordinal ++ " is out of the range of the indexReader " ++ indexReader.toString() ++ ". The maximum possible ordinal number is " ++ (indexReader.maxDoc() - 1)); + } +} + } + + /** + * Returns an array of FacetLabels for a given array of ordinals. + * + * This API is generally faster than iteratively calling {@link #getPath(int)} over an array of + * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index + * was created using StoredFields (with no performance gains) and uses DocValues based iteration + * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0 + * + * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy + * index + */ + @Override + public FacetLabel[] getBulkPath(int... ordinals) throws IOException { +ensureOpen(); +checkOrdinalBounds(ordinals); + +int ordinalsLength = ordinals.length; +FacetLabel[] bulkPath = new FacetLabel[ordinalsLength]; +// remember the original positions of ordinals before they are sorted +int[] originalPosition = new int[ordinalsLength]; +Arrays.setAll(originalPosition, IntUnaryOperator.identity()); + +getPathFromCache(ordinals); + +/* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */ +new InPlaceMergeSorter() { + @Override + protected void swap(int i, int j) { +int x = ordinals[i]; +ordinals[i] = ordinals[j]; +ordinals[j] = x; + +x = originalPosition[i]; +originalPosition[i] = originalPosition[j]; +originalPosition[j] = x; + } + + @Override + public int compare(int i, int j) { +return Integer.compare(ordinals[i], ordinals[j]); + } +}.sort(0, ordinalsLength); + +int readerIndex; +int leafReaderMaxDoc = 0; +int leafReaderDocBase = 0; +LeafReader leafReader; +LeafReaderContext leafReaderContext; +BinaryDocValues values = null; +List uncachedOrdinalPositions = new ArrayList<>(); + +for (int i = 0; i < ordinalsLength; i++) { + if (bulkPath[originalPosition[i]] == null) { +/* +If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal. +Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc +(which is the maxDoc of the specific leaf) + */ +if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) { + + readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves()); + leafReaderContext = indexReader.leaves().get(readerIndex); + leafReader = leafReaderContext.reader(); + leafReaderMaxDoc = leafReader.maxDoc(); + leafReaderDocBase = leafReaderContext.docBase; + values = leafReader.getBinaryDocValues(Consts.FULL); + + /* + If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null + */ + if (values == null) { Review comment: Sure. I'll create it after this PR is merged. Otherwise, it just feels weird to create an issue for a fix in a piece of code that has not been merged. -- This is an automated message from t
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
gautamworah96 commented on a change in pull request #179: URL: https://github.com/apache/lucene/pull/179#discussion_r698080163 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java ## @@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws IOException { } synchronized (categoryCache) { - categoryCache.put(catIDInteger, ret); + categoryCache.put(ordinal, ret); } return ret; } + private FacetLabel[] getPathFromCache(int... ordinals) { +FacetLabel[] facetLabels = new FacetLabel[ordinals.length]; +// TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap, +// wrapped as LRU? +synchronized (categoryCache) { + for (int i = 0; i < ordinals.length; i++) { +facetLabels[i] = categoryCache.get(ordinals[i]); + } +} +return facetLabels; + } + + /** + * Checks if the ordinals in the array are >=0 and < {@code + * DirectoryTaxonomyReader#indexReader.maxDoc()} + * + * @param ordinals Integer array of ordinals + * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is + * out of bounds + */ + private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException { +for (int ordinal : ordinals) { + if (ordinal < 0 || ordinal >= indexReader.maxDoc()) { +throw new IllegalArgumentException( +"ordinal " ++ ordinal ++ " is out of the range of the indexReader " ++ indexReader.toString() ++ ". The maximum possible ordinal number is " ++ (indexReader.maxDoc() - 1)); + } +} + } + + /** + * Returns an array of FacetLabels for a given array of ordinals. + * + * This API is generally faster than iteratively calling {@link #getPath(int)} over an array of + * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index + * was created using StoredFields (with no performance gains) and uses DocValues based iteration + * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0 + * + * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy + * index + */ + @Override + public FacetLabel[] getBulkPath(int... ordinals) throws IOException { +ensureOpen(); +checkOrdinalBounds(ordinals); + +int ordinalsLength = ordinals.length; +FacetLabel[] bulkPath = new FacetLabel[ordinalsLength]; +// remember the original positions of ordinals before they are sorted +int[] originalPosition = new int[ordinalsLength]; +Arrays.setAll(originalPosition, IntUnaryOperator.identity()); + +getPathFromCache(ordinals); + +/* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */ +new InPlaceMergeSorter() { + @Override + protected void swap(int i, int j) { +int x = ordinals[i]; +ordinals[i] = ordinals[j]; +ordinals[j] = x; + +x = originalPosition[i]; +originalPosition[i] = originalPosition[j]; +originalPosition[j] = x; + } + + @Override + public int compare(int i, int j) { +return Integer.compare(ordinals[i], ordinals[j]); + } +}.sort(0, ordinalsLength); + +int readerIndex; +int leafReaderMaxDoc = 0; +int leafReaderDocBase = 0; +LeafReader leafReader; +LeafReaderContext leafReaderContext; +BinaryDocValues values = null; +List uncachedOrdinalPositions = new ArrayList<>(); + +for (int i = 0; i < ordinalsLength; i++) { + if (bulkPath[originalPosition[i]] == null) { +/* +If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal. +Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc +(which is the maxDoc of the specific leaf) + */ +if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) { + + readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves()); + leafReaderContext = indexReader.leaves().get(readerIndex); + leafReader = leafReaderContext.reader(); + leafReaderMaxDoc = leafReader.maxDoc(); + leafReaderDocBase = leafReaderContext.docBase; + values = leafReader.getBinaryDocValues(Consts.FULL); + + /* + If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null + */ + if (values == null) { Review comment: Sure. I'll create it after this PR is merged. Otherwise, it just feels weird to create an issue for a fix in a piece of code that has not been merged :D -- This is an automated message from the
[GitHub] [lucene] zacharymorn commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index check across segments
zacharymorn commented on pull request #128: URL: https://github.com/apache/lucene/pull/128#issuecomment-907952441 > Thanks @zacharymorn -- the changes look awesome! I'm looking forward to faster `CheckIndex`! Thanks @mikemccand again for the review and approval, and I look forward to the speedup as well! I'll merge in a few days just in case if anyone else may have further comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10059) Assertion error in JapaneseTokenizer backtrace
[ https://issues.apache.org/jira/browse/LUCENE-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406501#comment-17406501 ] Anh Dung Bui commented on LUCENE-10059: --- Thanks [~mikemccand], we already backported to Lucene 8.x. I'll resolve this and create follow-up JIRA. > Assertion error in JapaneseTokenizer backtrace > -- > > Key: LUCENE-10059 > URL: https://issues.apache.org/jira/browse/LUCENE-10059 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.8 >Reporter: Anh Dung Bui >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > There is a rare case which causes an AssertionError in the backtrace step of > JapaneseTokenizer that we (Amazon Product Search) found in our tests. > If there is a text span of length 1024 (determined by > [MAX_BACKTRACE_GAP|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L116]) > where the regular backtrace is not called, a [forced > backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L781] > will be applied. If the partially best path at this point happens to end at > the last pos, and since there is always a [final > backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L1044] > applied at the end, the final backtrace will try to backtrace from and to > the same position, causing an AssertionError in RollingCharBuffer.get() when > it tries to generate an empty buffer. > We are fixing it by returning prematurely in the backtrace() method when the > from and to pos are the same: > {code:java} > if (endPos == lastBackTracePos) { > return; > } > {code} > The backtrace() method is essentially no-op when this condition happens, thus > when _-ea_ is not enabled, it can still output the correct tokens. > We will open a PR for this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10059) Assertion error in JapaneseTokenizer backtrace
[ https://issues.apache.org/jira/browse/LUCENE-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anh Dung Bui updated LUCENE-10059: -- Fix Version/s: main (9.0) 8.x > Assertion error in JapaneseTokenizer backtrace > -- > > Key: LUCENE-10059 > URL: https://issues.apache.org/jira/browse/LUCENE-10059 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.8 >Reporter: Anh Dung Bui >Priority: Major > Fix For: 8.x, main (9.0) > > Time Spent: 1h 20m > Remaining Estimate: 0h > > There is a rare case which causes an AssertionError in the backtrace step of > JapaneseTokenizer that we (Amazon Product Search) found in our tests. > If there is a text span of length 1024 (determined by > [MAX_BACKTRACE_GAP|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L116]) > where the regular backtrace is not called, a [forced > backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L781] > will be applied. If the partially best path at this point happens to end at > the last pos, and since there is always a [final > backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L1044] > applied at the end, the final backtrace will try to backtrace from and to > the same position, causing an AssertionError in RollingCharBuffer.get() when > it tries to generate an empty buffer. > We are fixing it by returning prematurely in the backtrace() method when the > from and to pos are the same: > {code:java} > if (endPos == lastBackTracePos) { > return; > } > {code} > The backtrace() method is essentially no-op when this condition happens, thus > when _-ea_ is not enabled, it can still output the correct tokens. > We will open a PR for this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10059) Assertion error in JapaneseTokenizer backtrace
[ https://issues.apache.org/jira/browse/LUCENE-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anh Dung Bui resolved LUCENE-10059. --- Resolution: Fixed > Assertion error in JapaneseTokenizer backtrace > -- > > Key: LUCENE-10059 > URL: https://issues.apache.org/jira/browse/LUCENE-10059 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.8 >Reporter: Anh Dung Bui >Priority: Major > Fix For: 8.x, main (9.0) > > Time Spent: 1h 20m > Remaining Estimate: 0h > > There is a rare case which causes an AssertionError in the backtrace step of > JapaneseTokenizer that we (Amazon Product Search) found in our tests. > If there is a text span of length 1024 (determined by > [MAX_BACKTRACE_GAP|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L116]) > where the regular backtrace is not called, a [forced > backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L781] > will be applied. If the partially best path at this point happens to end at > the last pos, and since there is always a [final > backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L1044] > applied at the end, the final backtrace will try to backtrace from and to > the same position, causing an AssertionError in RollingCharBuffer.get() when > it tries to generate an empty buffer. > We are fixing it by returning prematurely in the backtrace() method when the > from and to pos are the same: > {code:java} > if (endPos == lastBackTracePos) { > return; > } > {code} > The backtrace() method is essentially no-op when this condition happens, thus > when _-ea_ is not enabled, it can still output the correct tokens. > We will open a PR for this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8723) Bad interaction bewteen WordDelimiterGraphFilter, StopFilter and FlattenGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-8723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406516#comment-17406516 ] Geoffrey Lawson commented on LUCENE-8723: - I believe this issue has been fixed in LUCENE-9963. The test {{testAltPathLastStepHoleWithoutEndToken}} in {{TestFlattenGraphFilter}} is based on this problem. There were some issues with {{FlattenGraphFilter}} not correctly handling deleted tokens. Both test cases in this issue appear to be working now. > Bad interaction bewteen WordDelimiterGraphFilter, StopFilter and > FlattenGraphFilter > --- > > Key: LUCENE-8723 > URL: https://issues.apache.org/jira/browse/LUCENE-8723 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 7.7.1, 8.0, 8.3 >Reporter: Nicolás Lichtmaier >Priority: Major > > I was debugging an issue (missing tokens after analysis) and when I enabled > Java assertions I uncovered a bug when using WordDelimiterGraphFilter + > StopFilter + FlattenGraphFilter. > I could reproduce the issue in a small piece of code. This code gives an > assertion failure when assertions are enabled (-ea java option): > {code:java} > Builder builder = CustomAnalyzer.builder(); > builder.withTokenizer(StandardTokenizerFactory.class); > builder.addTokenFilter(WordDelimiterGraphFilterFactory.class, > "preserveOriginal", "1"); > builder.addTokenFilter(StopFilterFactory.class); > builder.addTokenFilter(FlattenGraphFilterFactory.class); > Analyzer analyzer = builder.build(); > > TokenStream ts = analyzer.tokenStream("*", new StringReader("x7in")); > ts.reset(); > while(ts.incrementToken()) > ; > {code} > This gives: > {code} > Exception in thread "main" java.lang.AssertionError: 2 > at > org.apache.lucene.analysis.core.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:195) > at > org.apache.lucene.analysis.core.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:258) > at com.wolfram.textsearch.AnalyzerError.main(AnalyzerError.java:32) > {code} > Maybe removing stop words after WordDelimiterGraphFilter is wrong, I don't > know. However is the only way to process stop-words generated by that filter. > In any case, it should not eat tokens or produce assertions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org