[jira] [Updated] (LUCENE-10233) Store docIds as bitset when doc IDs are strictly sorted and dense

2022-04-06 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-10233:
--
Summary: Store docIds as bitset when doc IDs are strictly sorted and dense  
(was: Store docIds as bitset when leafCardinality = 1 to speed up addAll)

> Store docIds as bitset when doc IDs are strictly sorted and dense
> -
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
> Fix For: 9.1
>
> Attachments: SparseFixedBitSet.png
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and {{intersect}} will get into {{addAll}} logic. 
> If we store ids as bitset, and give the IntersectVisitor bulk visiting 
> ability, we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Optimization will be triggered when the following conditions are met at the 
> same time:
>  # doc IDs are sorted strictly
>  # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding 
> too much storage)
> I mocked a field that has 10,000,000 docs per value and search it with a 1 
> term PointInSetQuery, the build scorer time decreased from 151ms to 5ms.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10500) StringValueFacetCounts relies on sequential collection

2022-04-06 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10500.
---
Fix Version/s: 9.2
   Resolution: Fixed

> StringValueFacetCounts relies on sequential collection
> --
>
> Key: LUCENE-10500
> URL: https://issues.apache.org/jira/browse/LUCENE-10500
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Luca Cavanna
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We recently moved some of the facets tests to use IndexSearcher#search(Query, 
> CollectorManager) providing a FacetsCollectorManager instead of a 
> FacetsCollector. Whenever newIndexSearcher(IndexReader) is used in tests, 
> concurrent search may now be exercised while it was not before.
> This caused some build failures on TestStringValueFacetCounts:
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
>   at 
> __randomizedtesting.SeedInfo.seed([ED8BF8281FCE5C02:9FC7DD27AEAEEA71]:0)
>   at 
> org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.util.packed.Packed64.get(Packed64.java:81)
>   at 
> org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.OrdinalMap$2.get(OrdinalMap.java:346)
>   at 
> org.apache.lucene.facet.StringValueFacetCounts.countOneSegment(StringValueFacetCounts.java:440)
>   at 
> org.apache.lucene.facet.StringValueFacetCounts.count(StringValueFacetCounts.java:295)
>   at 
> org.apache.lucene.facet.StringValueFacetCounts.(StringValueFacetCounts.java:123)
>   at 
> org.apache.lucene.facet.TestStringValueFacetCounts.checkFacetResult(TestStringValueFacetCounts.java:349)
>   at 
> org.apache.lucene.facet.TestStringValueFacetCounts.testRandom(TestStringValueFacetCounts.java:325)
> {code}
> This looks like a real bug, as StringValueFacetCounts#countOneSegment is 
> called once providing the index of the current loop instead of the ordinal 
> taken from the matching hits that we are analyzing. That works fine with 
> single threaded collection as we will go sequentially and the two indices 
> will always be the same. With multi-threaded search, the order of the 
> returned matching hits (one per segment) is not deterministic.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10502) Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-04-06 Thread Lu Xugang (Jira)
Lu Xugang created LUCENE-10502:
--

 Summary: Use IndexedDISI to store docIds and 
DirectMonotonicWriter/Reader to handle ordToDoc 
 Key: LUCENE-10502
 URL: https://issues.apache.org/jira/browse/LUCENE-10502
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 9.1
Reporter: Lu Xugang


Now 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10502) Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-04-06 Thread Lu Xugang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang updated LUCENE-10502:
---
Description: Since at search phase, vector's all docs of all fields will be 
fully loaded into memory, could we use IndexedDISI to store docIds and 
DirectMonotonicWriter/Reader to handle ordToDoc?  (was: Since at search phase, 
vector's all docs of all fields will be fully loaded into memory, could we )

> Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle 
> ordToDoc 
> 
>
> Key: LUCENE-10502
> URL: https://issues.apache.org/jira/browse/LUCENE-10502
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.1
>Reporter: Lu Xugang
>Priority: Major
>
> Since at search phase, vector's all docs of all fields will be fully loaded 
> into memory, could we use IndexedDISI to store docIds and 
> DirectMonotonicWriter/Reader to handle ordToDoc?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10502) Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-04-06 Thread Lu Xugang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang updated LUCENE-10502:
---
Description: Since at search phase, vector's all docs of all fields will be 
fully loaded into memory, could we   (was: Now )

> Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle 
> ordToDoc 
> 
>
> Key: LUCENE-10502
> URL: https://issues.apache.org/jira/browse/LUCENE-10502
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.1
>Reporter: Lu Xugang
>Priority: Major
>
> Since at search phase, vector's all docs of all fields will be fully loaded 
> into memory, could we 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10502) Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-04-06 Thread Lu Xugang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang updated LUCENE-10502:
---
Description: Since at search phase, vector's all docs of all fields will be 
fully loaded into memory, could we use IndexedDISI to store docIds and 
DirectMonotonicWriter/Reader to handle ordToDoc mapping?  (was: Since at search 
phase, vector's all docs of all fields will be fully loaded into memory, could 
we use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle 
ordToDoc?)

> Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle 
> ordToDoc 
> 
>
> Key: LUCENE-10502
> URL: https://issues.apache.org/jira/browse/LUCENE-10502
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.1
>Reporter: Lu Xugang
>Priority: Major
>
> Since at search phase, vector's all docs of all fields will be fully loaded 
> into memory, could we use IndexedDISI to store docIds and 
> DirectMonotonicWriter/Reader to handle ordToDoc mapping?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang opened a new pull request, #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-04-06 Thread GitBox


LuXugang opened a new pull request, #792:
URL: https://github.com/apache/lucene/pull/792

   Since at search phase, vector's all docs of all fields will be fully loaded 
into memory, could we use IndexedDISI to store docIds and 
DirectMonotonicWriter/Reader to handle ordToDoc mapping?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on pull request #788: LUCENE-10500: StringValueFacetCounts to not rely on sequential collection

2022-04-06 Thread GitBox


javanna commented on PR #788:
URL: https://github.com/apache/lucene/pull/788#issuecomment-1089977302

   oh well if I had to apologize for every bug I committed... happy to help! 
Also good to see that using collector managers in tests helped uncover this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta opened a new pull request, #793: LUCENE-10493: add 'backWordPos' array to JapaneseTokenizer.Position

2022-04-06 Thread GitBox


mocobeta opened a new pull request, #793:
URL: https://github.com/apache/lucene/pull/793

   `JapaneseTokenizer.Position` and `KoreanTokenizer.Position` are almost the 
same except for `backWordPos` array, which only exists in KoreanTokenizer. To 
factor out the viterbi algorighm, the two `Position` classes have to be made 
identical, at least for the moment.
   I'm sorry that this adds the extra int array to KuromojTokenizer, but I 
think the integration is worth much that and optimization can come later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10503) Preserve more significant bits of scores in WANDScorer

2022-04-06 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10503:
-

 Summary: Preserve more significant bits of scores in WANDScorer
 Key: LUCENE-10503
 URL: https://issues.apache.org/jira/browse/LUCENE-10503
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


WANDScorer operates on longs to avoid accuracy issues with floating-point 
numbers. The current process loses more accuracy bits than it could, and making 
it better could help skip in a few more situations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new pull request, #794: LUCENE-10153: Improve accuracy of scaled scores in WANDScorer.

2022-04-06 Thread GitBox


jpountz opened a new pull request, #794:
URL: https://github.com/apache/lucene/pull/794

   See https://issues.apache.org/jira/browse/LUCENE-10503.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #785: LUCENE-10002: move MemoryIndex to search(Query, CollectorManager)

2022-04-06 Thread GitBox


jpountz merged PR #785:
URL: https://github.com/apache/lucene/pull/785


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)

2022-04-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517958#comment-17517958
 ] 

ASF subversion and git services commented on LUCENE-10002:
--

Commit 74e9716aec74e862b3073e01d3ccbccb199b41e0 in lucene's branch 
refs/heads/main from Luca Cavanna
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=74e9716aec7 ]

LUCENE-10002: move MemoryIndex to search(Query, CollectorManager) (#785)



> Remove IndexSearcher#search(Query,Collector) in favor of 
> IndexSearcher#search(Query,CollectorManager)
> -
>
> Key: LUCENE-10002
> URL: https://issues.apache.org/jira/browse/LUCENE-10002
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 16h 50m
>  Remaining Estimate: 0h
>
> It's a bit trappy that you can create an IndexSearcher with an executor, but 
> that it would always search on the caller thread when calling 
> {{IndexSearcher#search(Query,Collector)}}.
>  Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to 
> {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory 
> methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to 
> return a {{CollectorManager}} instead of a {{Collector}}?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #787: LUCENE-10002: replace more usages of search(Query, Collector) in tests

2022-04-06 Thread GitBox


jpountz merged PR #787:
URL: https://github.com/apache/lucene/pull/787


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)

2022-04-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517962#comment-17517962
 ] 

ASF subversion and git services commented on LUCENE-10002:
--

Commit 1cf1b301af050c9aaedec6bfcbaaebafa6fa3241 in lucene's branch 
refs/heads/main from Luca Cavanna
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1cf1b301af0 ]

LUCENE-10002: replace more usages of search(Query, Collector) in tests (#787)

This commit replaces more usages of search(Query, Collector) with calling the 
corresponding search(Query, CollectorManager) instead. This round focuses on 
tests that implement custom collector, that need a corresponding collector 
manager.

> Remove IndexSearcher#search(Query,Collector) in favor of 
> IndexSearcher#search(Query,CollectorManager)
> -
>
> Key: LUCENE-10002
> URL: https://issues.apache.org/jira/browse/LUCENE-10002
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 17h
>  Remaining Estimate: 0h
>
> It's a bit trappy that you can create an IndexSearcher with an executor, but 
> that it would always search on the caller thread when calling 
> {{IndexSearcher#search(Query,Collector)}}.
>  Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to 
> {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory 
> methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to 
> return a {{CollectorManager}} instead of a {{Collector}}?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)

2022-04-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517976#comment-17517976
 ] 

ASF subversion and git services commented on LUCENE-10002:
--

Commit 37434ffb1fcaf5e7a9096b13204fd640a9c8113e in lucene's branch 
refs/heads/branch_9x from Luca Cavanna
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=37434ffb1fc ]

LUCENE-10002: move MemoryIndex to search(Query, CollectorManager) (#785)



> Remove IndexSearcher#search(Query,Collector) in favor of 
> IndexSearcher#search(Query,CollectorManager)
> -
>
> Key: LUCENE-10002
> URL: https://issues.apache.org/jira/browse/LUCENE-10002
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 17h
>  Remaining Estimate: 0h
>
> It's a bit trappy that you can create an IndexSearcher with an executor, but 
> that it would always search on the caller thread when calling 
> {{IndexSearcher#search(Query,Collector)}}.
>  Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to 
> {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory 
> methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to 
> return a {{CollectorManager}} instead of a {{Collector}}?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)

2022-04-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517977#comment-17517977
 ] 

ASF subversion and git services commented on LUCENE-10002:
--

Commit ccd21fa5d9df7f2a30cd81784f49b3f08116c300 in lucene's branch 
refs/heads/branch_9x from Luca Cavanna
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ccd21fa5d9d ]

LUCENE-10002: replace more usages of search(Query, Collector) in tests (#787)

This commit replaces more usages of search(Query, Collector) with calling the 
corresponding search(Query, CollectorManager) instead. This round focuses on 
tests that implement custom collector, that need a corresponding collector 
manager.

> Remove IndexSearcher#search(Query,Collector) in favor of 
> IndexSearcher#search(Query,CollectorManager)
> -
>
> Key: LUCENE-10002
> URL: https://issues.apache.org/jira/browse/LUCENE-10002
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 17h
>  Remaining Estimate: 0h
>
> It's a bit trappy that you can create an IndexSearcher with an executor, but 
> that it would always search on the caller thread when calling 
> {{IndexSearcher#search(Query,Collector)}}.
>  Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to 
> {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory 
> methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to 
> return a {{CollectorManager}} instead of a {{Collector}}?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?

2022-04-06 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517988#comment-17517988
 ] 

Tomoko Uchida commented on LUCENE-10493:


I'm starting this with small steps. I'll try to keep the commits 
self-contained, and also as small as possible for safety.
https://github.com/apache/lucene/pull/793

Let me know if there is any feedback, thanks! 

> Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
> -
>
> Key: LUCENE-10493
> URL: https://issues.apache.org/jira/browse/LUCENE-10493
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We now have common dictionary interfaces for kuromoji and nori 
> ([LUCENE-10393]). A natural question would be: is it possible to unify the 
> Japanese/Korean tokenizers? 
> The core methods of the two tokenizers are `parse()` and `backtrace()` to 
> calculate the minimum cost path by Viterbi search. I'd set the goal of this 
> issue to factoring out them into a separate class (in analysis-common) that 
> is shared between JapaneseTokenizer and KoreanTokenizer. 
> The algorithm to solve the minimum cost path itself is of course 
> language-agnostic, so I think it should be theoretically possible; the most 
> difficult part here might be the N-best path calculation - which is supported 
> only by JapaneseTokenizer and not by KoreanTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta opened a new pull request, #795: LUCENE-10493: Unify TokenInfoFST in kuromoji and nori

2022-04-06 Thread GitBox


mocobeta opened a new pull request, #795:
URL: https://github.com/apache/lucene/pull/795

   `org.apache.lucene.analysis.[ja|ko].dict.TokenInfoFST` are exactly the same 
except for the range of cached FST root arcs; we can safely unify the cache 
logic and I need this for LUCENE-10493.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?

2022-04-06 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517988#comment-17517988
 ] 

Tomoko Uchida edited comment on LUCENE-10493 at 4/6/22 11:42 AM:
-

I'm starting this with small steps. I'll try to keep the commits 
self-contained, and also as small as possible for safety.
https://github.com/apache/lucene/pull/793
https://github.com/apache/lucene/pull/795

Let me know if there is any feedback, thanks! 


was (Author: tomoko uchida):
I'm starting this with small steps. I'll try to keep the commits 
self-contained, and also as small as possible for safety.
https://github.com/apache/lucene/pull/793

Let me know if there is any feedback, thanks! 

> Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
> -
>
> Key: LUCENE-10493
> URL: https://issues.apache.org/jira/browse/LUCENE-10493
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We now have common dictionary interfaces for kuromoji and nori 
> ([LUCENE-10393]). A natural question would be: is it possible to unify the 
> Japanese/Korean tokenizers? 
> The core methods of the two tokenizers are `parse()` and `backtrace()` to 
> calculate the minimum cost path by Viterbi search. I'd set the goal of this 
> issue to factoring out them into a separate class (in analysis-common) that 
> is shared between JapaneseTokenizer and KoreanTokenizer. 
> The algorithm to solve the minimum cost path itself is of course 
> language-agnostic, so I think it should be theoretically possible; the most 
> difficult part here might be the N-best path calculation - which is supported 
> only by JapaneseTokenizer and not by KoreanTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 commented on pull request #786: LUCENE-10499: reduce unnecessary copy data overhead when growing array size

2022-04-06 Thread GitBox


wjp719 commented on PR #786:
URL: https://github.com/apache/lucene/pull/786#issuecomment-1090235148

   @jpountz Hi, can you help to take some time to review this PR, thanks a lot


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?

2022-04-06 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518271#comment-17518271
 ] 

Tomoko Uchida commented on LUCENE-10493:


I'm trying to factor out the core algorithm from Japanese/Korean Tokenizers 
with the above modifications - it is still a very rough patch but anyhow, seems 
to work... 
I'd merge #793 and #795 after waiting for one or two days and then prepare the 
main PR. The next step can't be small to show the full picture (creating a base 
`Viterbi` class in analysis-common, moving the common logic to it, and 
rewriting  Japanese/Korean Tokenizers upon it), though, I will try to sort out 
the interfaces for review.

> Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
> -
>
> Key: LUCENE-10493
> URL: https://issues.apache.org/jira/browse/LUCENE-10493
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We now have common dictionary interfaces for kuromoji and nori 
> ([LUCENE-10393]). A natural question would be: is it possible to unify the 
> Japanese/Korean tokenizers? 
> The core methods of the two tokenizers are `parse()` and `backtrace()` to 
> calculate the minimum cost path by Viterbi search. I'd set the goal of this 
> issue to factoring out them into a separate class (in analysis-common) that 
> is shared between JapaneseTokenizer and KoreanTokenizer. 
> The algorithm to solve the minimum cost path itself is of course 
> language-agnostic, so I think it should be theoretically possible; the most 
> difficult part here might be the N-best path calculation - which is supported 
> only by JapaneseTokenizer and not by KoreanTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil

2022-04-06 Thread Feng Guo (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518272#comment-17518272
 ] 

Feng Guo commented on LUCENE-10315:
---

Thanks [~ivera], [~jpountz] for all effort and suggestions here! 

FYI, here is something interesting: I tried to change
{code:java}
@Benchmark
public void readInts24ForUtilVisitor(IntDecodeState state, Blackhole bh) {
decode24(state);
for (int i = 0; i < state.count; i++) {
bh.consume(state.outputInts[i]);
}
}
{code}
To
{code:java}
@Benchmark
public void readInts24ForUtilVisitorImproved(IntDecodeState state, 
Blackhole bh) {
decode24(state);
int[] ints = state.outputInts;
for (int i = 0; i < state.count; i++) {
bh.consume(ints[i]);
}
}
{code}
And here is the result:
{code:java}
Benchmark  Mode  Cnt  Score   Error 
  Units
ReadInts24Benchmark.readInts24ForUtilVisitor  thrpt   10  0.776 ± 0.012 
 ops/us
ReadInts24Benchmark.readInts24ForUtilVisitorImproved  thrpt   10  0.848 ± 0.012 
 ops/us
ReadInts24Benchmark.readInts24Visitor thrpt   10  0.786 ± 0.006 
 ops/us

$ java -version
openjdk version "17.0.2" 2022-01-18
OpenJDK Runtime Environment (build 17.0.2+8-86)
OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing)
{code}

> Speed up BKD leaf block ids codec by a 512 ints ForUtil
> ---
>
> Key: LUCENE-10315
> URL: https://issues.apache.org/jira/browse/LUCENE-10315
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
> Attachments: addall.svg, cpu_profile_baseline.html, 
> cpu_profile_path.html
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Elasticsearch (which based on lucene) can automatically infers types for 
> users with its dynamic mapping feature. When users index some low cardinality 
> fields, such as gender / age / status... they often use some numbers to 
> represent the values, while ES will infer these fields as {{{}long{}}}, and 
> ES uses BKD as the index of {{long}} fields. When the data volume grows, 
> building the result set of low-cardinality fields will make the CPU usage and 
> load very high.
> This is a flame graph we obtained from the production environment:
> [^addall.svg]
> It can be seen that almost all CPU is used in addAll. When we reindex 
> {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly 
> reduced ( We spent weeks of time to reindex all indices... ). I know that ES 
> recommended to use {{keyword}} for term/terms query and {{long}} for range 
> query in the document, but there are always some users who didn't realize 
> this and keep their habit of using sql database, or dynamic mapping 
> automatically selects the type for them. All in all, users won't realize that 
> there would be such a big difference in performance between {{long}} and 
> {{keyword}} fields in low cardinality fields. So from my point of view it 
> will make sense if we can make BKD works better for the low/medium 
> cardinality fields.
> As far as i can see, for low cardinality fields, there are two advantages of 
> {{keyword}} over {{{}long{}}}:
> 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's 
> delta VInt, because its batch reading (readLongs) and SIMD decode.
> 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily 
> materialize of its result set, and when another small result clause 
> intersects with this low cardinality condition, the low cardinality field can 
> avoid reading all docIds into memory.
> This ISSUE is targeting to solve the first point. The basic idea is trying to 
> use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization 
> by mocking some random {{LongPoint}} and querying them with 
> {{PointInSetQuery}}.
> *Benchmark Result*
> |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff 
> percentage|
> |1|32|1|51.44|148.26|188.22%|
> |1|32|2|26.8|101.88|280.15%|
> |1|32|4|14.04|53.52|281.20%|
> |1|32|8|7.04|28.54|305.40%|
> |1|32|16|3.54|14.61|312.71%|
> |1|128|1|110.56|350.26|216.81%|
> |1|128|8|16.6|89.81|441.02%|
> |1|128|16|8.45|48.07|468.88%|
> |1|128|32|4.2|25.35|503.57%|
> |1|128|64|2.13|13.02|511.27%|
> |1|1024|1|536.19|843.88|57.38%|
> |1|1024|8|109.71|251.89|129.60%|
> |1|1024|32|33.24|104.11|213.21%|
> |1|1024|128|8.87|30.47|243.52%|
> |1|1024|512|2.24|8.3|270.54%|
> |1|8192|1|.33|5000|50.00%|
> |1|8192|32|139.47|214.59|53.86%|
> |1|8192|128|54.59|109.23|100.09%|
> |1

[GitHub] [lucene] zhaih commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide

2022-04-06 Thread GitBox


zhaih commented on code in PR #762:
URL: https://github.com/apache/lucene/pull/762#discussion_r844154121


##
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestAlwaysRefreshDirectoryTaxonomyReader.java:
##
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy.directory;
+
+import java.io.IOException;
+import java.nio.file.Path;
+import org.apache.lucene.facet.FacetTestCase;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.FacetsConfig;
+import org.apache.lucene.facet.taxonomy.FacetLabel;
+import org.apache.lucene.facet.taxonomy.SearcherTaxonomyManager;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.util.IOUtils;
+
+public class TestAlwaysRefreshDirectoryTaxonomyReader extends FacetTestCase {
+
+  /**
+   * Tests the behavior of the {@link AlwaysRefreshDirectoryTaxonomyReader} by 
testing if the
+   * associated {@link SearcherTaxonomyManager} can successfully refresh and 
serve queries if the
+   * underlying taxonomy index is changed to an older checkpoint. Ideally, 
each checkpoint should be
+   * self-sufficient and should allow serving search queries when {@link
+   * SearcherTaxonomyManager#maybeRefresh()} is called.
+   *
+   * It does not check whether the private taxoArrays were actually 
recreated or no. We are
+   * (correctly) hiding away that complexity away from the user.
+   */
+  public void testAlwaysRefreshDirectoryTaxonomyReader() throws IOException {
+final Path taxoPath1 = createTempDir("dir1");
+final Directory dir1 = newFSDirectory(taxoPath1);
+final DirectoryTaxonomyWriter tw1 =
+new DirectoryTaxonomyWriter(dir1, IndexWriterConfig.OpenMode.CREATE);
+tw1.addCategory(new FacetLabel("a"));
+tw1.commit(); // commit1
+
+final Path taxoPath2 = createTempDir("commit1");
+final Directory commit1 = newFSDirectory(taxoPath2);
+// copy all index files from dir1
+for (String file : dir1.listAll()) {
+  commit1.copyFrom(dir1, file, file, IOContext.READ);
+}
+
+tw1.addCategory(new FacetLabel("b"));
+tw1.commit(); // commit2
+tw1.close();
+
+final DirectoryReader dr1 = DirectoryReader.open(dir1);
+// using a DirectoryTaxonomyReader here will cause the test to fail and 
throw a AIOOB exception

Review Comment:
   I guess I would write a generic function like
   ```
   void  testCase(Function dtrProducer, Class exceptionType)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide

2022-04-06 Thread GitBox


zhaih commented on code in PR #762:
URL: https://github.com/apache/lucene/pull/762#discussion_r844154121


##
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestAlwaysRefreshDirectoryTaxonomyReader.java:
##
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy.directory;
+
+import java.io.IOException;
+import java.nio.file.Path;
+import org.apache.lucene.facet.FacetTestCase;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.FacetsConfig;
+import org.apache.lucene.facet.taxonomy.FacetLabel;
+import org.apache.lucene.facet.taxonomy.SearcherTaxonomyManager;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.util.IOUtils;
+
+public class TestAlwaysRefreshDirectoryTaxonomyReader extends FacetTestCase {
+
+  /**
+   * Tests the behavior of the {@link AlwaysRefreshDirectoryTaxonomyReader} by 
testing if the
+   * associated {@link SearcherTaxonomyManager} can successfully refresh and 
serve queries if the
+   * underlying taxonomy index is changed to an older checkpoint. Ideally, 
each checkpoint should be
+   * self-sufficient and should allow serving search queries when {@link
+   * SearcherTaxonomyManager#maybeRefresh()} is called.
+   *
+   * It does not check whether the private taxoArrays were actually 
recreated or no. We are
+   * (correctly) hiding away that complexity away from the user.
+   */
+  public void testAlwaysRefreshDirectoryTaxonomyReader() throws IOException {
+final Path taxoPath1 = createTempDir("dir1");
+final Directory dir1 = newFSDirectory(taxoPath1);
+final DirectoryTaxonomyWriter tw1 =
+new DirectoryTaxonomyWriter(dir1, IndexWriterConfig.OpenMode.CREATE);
+tw1.addCategory(new FacetLabel("a"));
+tw1.commit(); // commit1
+
+final Path taxoPath2 = createTempDir("commit1");
+final Directory commit1 = newFSDirectory(taxoPath2);
+// copy all index files from dir1
+for (String file : dir1.listAll()) {
+  commit1.copyFrom(dir1, file, file, IOContext.READ);
+}
+
+tw1.addCategory(new FacetLabel("b"));
+tw1.commit(); // commit2
+tw1.close();
+
+final DirectoryReader dr1 = DirectoryReader.open(dir1);
+// using a DirectoryTaxonomyReader here will cause the test to fail and 
throw a AIOOB exception

Review Comment:
   I guess I would write a generic function like
   ```
void testCase(Function dtrProducer, Class exceptionType)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a diff in pull request #778: LUCENE-10495: Fix bug in TaxonomyFacets

2022-04-06 Thread GitBox


Yuti-G commented on code in PR #778:
URL: https://github.com/apache/lucene/pull/778#discussion_r844156099


##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java:
##
@@ -109,7 +109,7 @@ public boolean childrenLoaded() {
* @lucene.experimental
*/
   public boolean siblingsLoaded() {
-return children != null;
+return siblings != null;

Review Comment:
   Hi @gsmiller, thanks for your feedback! There is only one use case that I 
could find where `siblingsLoaded()` and `childrenLoaded()` can return different 
boolean value, and I added a test for it. Please let me know if there is any 
question. Thanks again!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a diff in pull request #778: LUCENE-10495: Fix bug in TaxonomyFacets

2022-04-06 Thread GitBox


Yuti-G commented on code in PR #778:
URL: https://github.com/apache/lucene/pull/778#discussion_r844156099


##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java:
##
@@ -109,7 +109,7 @@ public boolean childrenLoaded() {
* @lucene.experimental
*/
   public boolean siblingsLoaded() {
-return children != null;
+return siblings != null;

Review Comment:
   Hi @gsmiller, thanks for your feedback! There is only one use case that I 
could find where `siblingsLoaded()` and `childrenLoaded()` can return different 
boolean values, and I added a test for it. Please let me know if there is any 
question. Thanks again!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10495) Fix bug in TaxonomyFacets

2022-04-06 Thread Yuting Gan (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuting Gan updated LUCENE-10495:

Description: 
Found a bug in TaxonomyFacets when trying to use the siblingsLoaded function. 
siblingsLoaded() should return siblings != null and it returns children != null 
currently. 

 

 

  was:
Found a bug in TaxonomyFacets when trying to use the siblingsLoaded function. 
siblingsLoaded() should return siblings != null;

 

 


> Fix bug in TaxonomyFacets
> -
>
> Key: LUCENE-10495
> URL: https://issues.apache.org/jira/browse/LUCENE-10495
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Yuting Gan
>Priority: Minor
> Attachments: Screen Shot 2022-03-30 at 8.02.15 PM.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Found a bug in TaxonomyFacets when trying to use the siblingsLoaded function. 
> siblingsLoaded() should return siblings != null and it returns children != 
> null currently. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()

2022-04-06 Thread Chris M. Hostetter (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated LUCENE-10292:

Attachment: LUCENE-10292-1.patch
Status: Open  (was: Open)

{quote}I originally tried to replace the "R/W" locking of SearcherManager with 
an AtomicReference so we wouldn't need to have any synchronization blocks in 
{{lookup()}} at all; but I couldn't figure out a "safe" way to do that w/o ref 
counting the SearcherManager ...
{quote}
I don't know why it didn't occurto me yesterday, but the obviuos solution to 
this type of situation is a {{ReadWriteLock}} ... patch updated to use a 
writeLock() when replacing the {{SearcherManager}} and a {{readLock()}} in 
{{lookup()}} (and {{getCount()}}

> AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
> 
>
> Key: LUCENE-10292
> URL: https://issues.apache.org/jira/browse/LUCENE-10292
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Attachments: LUCENE-10292-1.patch, LUCENE-10292.patch
>
>
> I'm filing this based on anecdotal information from a Solr user w/o 
> experiencing it first hand (and I don't have a test case to demonstrate it) 
> but based on a reading of the code the underlying problem seems self 
> evident...
> With all other Lookup implementations I've examined, it is possible to call 
> {{lookup()}} regardless of whether another thread is concurrently calling 
> {{build()}} – in all cases I've seen, it is even possible to call 
> {{lookup()}} even if {{build()}} has never been called: the result is just an 
> "empty" {{List}} 
> Typically this is works because the {{build()}} method uses temporary 
> datastructures until it's "build logic" is complete, at which point it 
> atomically replaces the datastructures used by the {{lookup()}} method.   In 
> the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method 
> starts by closing & null'ing out the {{protected SearcherManager 
> searcherMgr}} (which it only populates again once it's completed building up 
> it's index) and then the lookup method starts with...
> {code:java}
> if (searcherMgr == null) {
>   throw new IllegalStateException("suggester was not built");
> }
> {code}
> ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any 
> situation where another thread may be calling 
> {{AnalyzingInfixSuggester.build()}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #718: LUCENE-10444: Support alternate aggregation functions in association facets

2022-04-06 Thread GitBox


gsmiller commented on PR #718:
URL: https://github.com/apache/lucene/pull/718#issuecomment-1090586015

   @mikemccand or @msokolov, did either of you have additional feedback? It 
didn't really look like it beyond the pre-existing bug (which I've since 
addressed), but I wanted to check before merging to make sure. Thanks again for 
the reviews!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10504) KnnGraphTester should use KnnVectorQuery

2022-04-06 Thread Michael Sokolov (Jira)
Michael Sokolov created LUCENE-10504:


 Summary: KnnGraphTester should use KnnVectorQuery
 Key: LUCENE-10504
 URL: https://issues.apache.org/jira/browse/LUCENE-10504
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael Sokolov


to get a more realistic picture, and to track developments in the query 
implementation, the tester should use that rather than implementing its own 
per-segment search and merging logic.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov opened a new pull request, #796: LUCENE-10504: KnnGraphTester to use KnnVectorQuery

2022-04-06 Thread GitBox


msokolov opened a new pull request, #796:
URL: https://github.com/apache/lucene/pull/796

   This really has two changes:
   
   1. it switches the vector searches it runs to use the Query impl, as the 
description says
   2. it becomes a bit more clever about managing its cache of "exact" NN that 
are used for recall comparisons. Previously, if you changed the source data 
files it would still potentially re-use the cached NN file. Now it stores a 
hash of the file name and looks at the modification times to see if it should 
regenerate the NN file
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #786: LUCENE-10499: reduce unnecessary copy data overhead when growing array size

2022-04-06 Thread GitBox


msokolov commented on PR #786:
URL: https://github.com/apache/lucene/pull/786#issuecomment-1090789978

   I don't much like the name either. I wouldn't block, but perhaps 
`growWithoutCopying`? or `growNoCopy`? The whole idea that we are growing the 
existing array is deceptive though because really we are just creating a new 
array (maybe), but I can't improve on the name much either,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani merged pull request #789: Add release wizard step around build failures

2022-04-06 Thread GitBox


jtibshirani merged PR #789:
URL: https://github.com/apache/lucene/pull/789


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #796: LUCENE-10504: KnnGraphTester to use KnnVectorQuery

2022-04-06 Thread GitBox


jtibshirani commented on code in PR #796:
URL: https://github.com/apache/lucene/pull/796#discussion_r844431197


##
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##
@@ -362,18 +367,19 @@ private void testSearch(Path indexPath, Path queryPath, 
Path outputPath, int[][]
   long cpuTimeStartNs;
   try (Directory dir = FSDirectory.open(indexPath);
   DirectoryReader reader = DirectoryReader.open(dir)) {
+IndexSearcher searcher = new IndexSearcher(reader);
 numDocs = reader.maxDoc();
 for (int i = 0; i < warmCount; i++) {
   // warm up
   targets.get(target);
-  results[i] = doKnnSearch(reader, KNN_FIELD, target, topK, fanout);
+  doKnnSearch(reader, KNN_FIELD, target, topK, fanout);

Review Comment:
   Would it be fine to use `doKnnVectorQuery` here so we could delete 
`doKnnSearch`?



##
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##
@@ -349,8 +353,9 @@ private void testSearch(Path indexPath, Path queryPath, 
Path outputPath, int[][]
 TopDocs[] results = new TopDocs[numIters];
 long elapsed, totalCpuTime, totalVisited = 0;
 try (FileChannel q = FileChannel.open(queryPath)) {
+  int bufferSize = Math.max(numIters, warmCount) * dim * Float.BYTES;

Review Comment:
   Maybe we could just assert warmCount < numIters, seems unusual to warm up 
with queries that you don't use in the benchmark?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 opened a new pull request, #797: LUCENE-10315: Speed up DocIdsWriter by ForUtil

2022-04-06 Thread GitBox


gf2121 opened a new pull request, #797:
URL: https://github.com/apache/lucene/pull/797

   https://issues.apache.org/jira/browse/LUCENE-10315


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil

2022-04-06 Thread Feng Guo (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518473#comment-17518473
 ] 

Feng Guo commented on LUCENE-10315:
---

Here is the benchmark result I got on my machine by 
[https://github.com/iverase/benchmark_forutil].
{code:java}
Benchmark                                            Mode  Cnt   Score   Error  
 Units
ReadInts24Benchmark.readInts24ForUtil               thrpt   25   9.086 ± 0.089  
ops/us
ReadInts24Benchmark.readInts24ForUtilVisitor        thrpt   25   0.764 ± 0.005  
ops/us
ReadInts24Benchmark.readInts24Legacy                thrpt   25   2.877 ± 0.013  
ops/us
ReadInts24Benchmark.readInts24Visitor               thrpt   25   0.778 ± 0.006  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLong1         thrpt   25   3.329 ± 0.023  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLong2         thrpt   25   3.218 ± 0.037  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLong3         thrpt   25   3.755 ± 0.017  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLong4         thrpt   25   3.862 ± 0.025  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor1  thrpt   25   0.710 ± 0.008  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor2  thrpt   25   0.849 ± 0.013  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor3  thrpt   25   0.804 ± 0.006  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor4  thrpt   25   0.768 ± 0.007  
ops/us
ReadIntsBenchmark.readIntsForUtil                   thrpt   25  18.957 ± 0.194  
ops/us
ReadIntsBenchmark.readIntsForUtilVisitor            thrpt   25   0.817 ± 0.004  
ops/us
ReadIntsBenchmark.readIntsLegacy                    thrpt   25   2.456 ± 0.016  
ops/us
ReadIntsBenchmark.readIntsLegacyVisitor             thrpt   25   0.608 ± 0.007  
ops/us
{code}
In this result, I'm seeing {{readInts24ForUtil}} runs 3 times faster than 
{{{}readInts24Legacy{}}}. This speed is attractive to me. So i'm trying to find 
some ways to solve the regression when calling visitor. A way i'm thinking 
about is to introduce {{visit(int[] docs, int count)}} for {{IntersectVisitor.}}

 

The benefit of this method:

1. This method can help reduce the number of virtual function call.
2. {{BufferAdder}} can directly use {{System#arraycopy}} to append doc ids.
3. {{InverseIntersectVisitor}} can count cost faster.




Based on luceneutil, I reproduced the regression successfully on my local 
machine by nightly benchmark tasks and random seed = 10:
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version 
 StdDevPct diff p-value
  IntNRQ   27.43  (1.8%)   24.12  
(1.1%)  -12.1% ( -14% -   -9%) 0.000
{code}
After the optimization, I can see the speed up with the same seed:
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version 
 StdDevPct diff p-value
  IntNRQ   27.68  (1.7%)   31.89  
(2.0%)   15.2% (  11% -   19%) 0.000
{code}


I post the draft code here: [https://github.com/apache/lucene/pull/797].
This commit 
[https://github.com/apache/lucene/pull/797/commits/7fb6ac3f5901a29d87e9fa427ba429d1e1749b14]
 shows what was changed.

> Speed up BKD leaf block ids codec by a 512 ints ForUtil
> ---
>
> Key: LUCENE-10315
> URL: https://issues.apache.org/jira/browse/LUCENE-10315
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
> Attachments: addall.svg, cpu_profile_baseline.html, 
> cpu_profile_path.html
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Elasticsearch (which based on lucene) can automatically infers types for 
> users with its dynamic mapping feature. When users index some low cardinality 
> fields, such as gender / age / status... they often use some numbers to 
> represent the values, while ES will infer these fields as {{{}long{}}}, and 
> ES uses BKD as the index of {{long}} fields. When the data volume grows, 
> building the result set of low-cardinality fields will make the CPU usage and 
> load very high.
> This is a flame graph we obtained from the production environment:
> [^addall.svg]
> It can be seen that almost all CPU is used in addAll. When we reindex 
> {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly 
> reduced ( We spent weeks of time to reindex all indices... ). I know that ES 
> recommended to use {{keyword}} for term/terms query and {{long}} for range 
> query in the document, but there are always some users who didn't realize 
> this and keep their habit of using sql database, or dynamic mapping 
> automatically selects the type for them. All in all, users won't realize that 
> there would be 

[GitHub] [lucene] msokolov commented on a diff in pull request #796: LUCENE-10504: KnnGraphTester to use KnnVectorQuery

2022-04-06 Thread GitBox


msokolov commented on code in PR #796:
URL: https://github.com/apache/lucene/pull/796#discussion_r89176


##
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##
@@ -362,18 +367,19 @@ private void testSearch(Path indexPath, Path queryPath, 
Path outputPath, int[][]
   long cpuTimeStartNs;
   try (Directory dir = FSDirectory.open(indexPath);
   DirectoryReader reader = DirectoryReader.open(dir)) {
+IndexSearcher searcher = new IndexSearcher(reader);
 numDocs = reader.maxDoc();
 for (int i = 0; i < warmCount; i++) {
   // warm up
   targets.get(target);
-  results[i] = doKnnSearch(reader, KNN_FIELD, target, topK, fanout);
+  doKnnSearch(reader, KNN_FIELD, target, topK, fanout);

Review Comment:
   Yes, that makes sense, I don't see why not



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #796: LUCENE-10504: KnnGraphTester to use KnnVectorQuery

2022-04-06 Thread GitBox


msokolov commented on code in PR #796:
URL: https://github.com/apache/lucene/pull/796#discussion_r844451157


##
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##
@@ -349,8 +353,9 @@ private void testSearch(Path indexPath, Path queryPath, 
Path outputPath, int[][]
 TopDocs[] results = new TopDocs[numIters];
 long elapsed, totalCpuTime, totalVisited = 0;
 try (FileChannel q = FileChannel.open(queryPath)) {
+  int bufferSize = Math.max(numIters, warmCount) * dim * Float.BYTES;

Review Comment:
   Yeah, I think we can simply replace warmCount with numIters



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #718: LUCENE-10444: Support alternate aggregation functions in association facets

2022-04-06 Thread GitBox


gsmiller merged PR #718:
URL: https://github.com/apache/lucene/pull/718


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10444) Support alternate aggregation functions in association facets

2022-04-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518474#comment-17518474
 ] 

ASF subversion and git services commented on LUCENE-10444:
--

Commit f870edf2fe26cffcd4bcddc760b8436c13424103 in lucene's branch 
refs/heads/main from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f870edf2fe2 ]

LUCENE-10444: Support alternate aggregation functions in association facets 
(#718)



> Support alternate aggregation functions in association facets
> -
>
> Key: LUCENE-10444
> URL: https://issues.apache.org/jira/browse/LUCENE-10444
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> We currently only support {{sum}} aggregations in the various association 
> facet implementations. I'd be really interested in extending the association 
> facet implementations to support other aggregations, starting with {{max}} 
> and {{min}} (in addition to {{{}sum{}}}). 
> I've been sketching up a prototype of this and I think I have a reasonable 
> way to introduce this idea. Will get a PR out for feedback soon.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch deci

2022-04-06 Thread GitBox


mikemccand commented on code in PR #762:
URL: https://github.com/apache/lucene/pull/762#discussion_r844479016


##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/AlwaysRefreshDirectoryTaxonomyReader.java:
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy.directory;
+
+import java.io.IOException;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IOUtils;
+
+/**
+ * A modified DirectoryTaxonomyReader that always recreates a new {@link
+ * AlwaysRefreshDirectoryTaxonomyReader} instance when {@link
+ * AlwaysRefreshDirectoryTaxonomyReader#doOpenIfChanged()} is called. This 
enables us to easily go
+ * forward or backward in time by re-computing the ordinal space during each 
refresh.

Review Comment:
   Hmm in the previous revision was this a test-only class?
   
   I'm nervous about making this available in the `facet` jar -- this class 
should only be used in exceptional cases, while most (normal) cases should use 
the normal `DTR` that optimizes the very common "refresh forward" case.  Can we 
maybe make `DTR` smarter to detect when a "roll backwards / revert" situation 
is happening on refresh?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide

2022-04-06 Thread GitBox


zhaih commented on code in PR #762:
URL: https://github.com/apache/lucene/pull/762#discussion_r844486980


##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/AlwaysRefreshDirectoryTaxonomyReader.java:
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy.directory;
+
+import java.io.IOException;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IOUtils;
+
+/**
+ * A modified DirectoryTaxonomyReader that always recreates a new {@link
+ * AlwaysRefreshDirectoryTaxonomyReader} instance when {@link
+ * AlwaysRefreshDirectoryTaxonomyReader#doOpenIfChanged()} is called. This 
enables us to easily go
+ * forward or backward in time by re-computing the ordinal space during each 
refresh.

Review Comment:
   Yeah I guess that's possible, right now we have a single `EPOCH` to 
represent the version of taxonomy index. And `EPOCH` is increased only when the 
taxonomy index was recreated. 
   
   I think we might be able to further define `VERSION` and it will be 
increased at each commit, and reset to 0 when `EPOCH` is bumped. So that then 
we can inherit an index iff `EPOCH` is the same and `VERSION` is newer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a diff in pull request #762: LUCENE-10482 Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch d

2022-04-06 Thread GitBox


gautamworah96 commented on code in PR #762:
URL: https://github.com/apache/lucene/pull/762#discussion_r844502429


##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/AlwaysRefreshDirectoryTaxonomyReader.java:
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy.directory;
+
+import java.io.IOException;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IOUtils;
+
+/**
+ * A modified DirectoryTaxonomyReader that always recreates a new {@link
+ * AlwaysRefreshDirectoryTaxonomyReader} instance when {@link
+ * AlwaysRefreshDirectoryTaxonomyReader#doOpenIfChanged()} is called. This 
enables us to easily go
+ * forward or backward in time by re-computing the ordinal space during each 
refresh.

Review Comment:
   ++ I separated it to a different class because in an earlier 
[comment](https://github.com/apache/lucene/pull/762#discussion_r840008690) we 
thought that it could be a good idea to directly expose it to our users. Let's 
keep it as a test class for now.
   We can explore the idea of making `DTR smarter to detect when a "roll 
backwards / revert" situation` in another issue (we will also have to think 
about how to handle older indexes etc). One thing at a time..
   I'll revert it back to a test class



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()

2022-04-06 Thread Chris M. Hostetter (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated LUCENE-10292:

Attachment: LUCENE-10292-2.patch
Status: Open  (was: Open)

I refactored the test code so that it could be applied to all other {{Lookup}} 
impls (that have a {{build()}} method) and found that while none of the other 
impls had the same problem of {{.lookup()}} failing to return suggestions 
during a (re)build, a few FST based {{Lookup}}s have  {{getCount()}} impls that 
return results that inconsistent from {{.lookup()}} due to incrementing a 
{{count}} variable gradually during {{build()}}.  

This latest patch (in addition to the expanded testing) fixes those {{build()}} 
methods to update their {{count}} value only after replacing the {{fst}} in use.

> AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
> 
>
> Key: LUCENE-10292
> URL: https://issues.apache.org/jira/browse/LUCENE-10292
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, 
> LUCENE-10292.patch
>
>
> I'm filing this based on anecdotal information from a Solr user w/o 
> experiencing it first hand (and I don't have a test case to demonstrate it) 
> but based on a reading of the code the underlying problem seems self 
> evident...
> With all other Lookup implementations I've examined, it is possible to call 
> {{lookup()}} regardless of whether another thread is concurrently calling 
> {{build()}} – in all cases I've seen, it is even possible to call 
> {{lookup()}} even if {{build()}} has never been called: the result is just an 
> "empty" {{List}} 
> Typically this is works because the {{build()}} method uses temporary 
> datastructures until it's "build logic" is complete, at which point it 
> atomically replaces the datastructures used by the {{lookup()}} method.   In 
> the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method 
> starts by closing & null'ing out the {{protected SearcherManager 
> searcherMgr}} (which it only populates again once it's completed building up 
> it's index) and then the lookup method starts with...
> {code:java}
> if (searcherMgr == null) {
>   throw new IllegalStateException("suggester was not built");
> }
> {code}
> ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any 
> situation where another thread may be calling 
> {{AnalyzingInfixSuggester.build()}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 commented on pull request #786: LUCENE-10499: reduce unnecessary copy data overhead when growing array size

2022-04-06 Thread GitBox


wjp719 commented on PR #786:
URL: https://github.com/apache/lucene/pull/786#issuecomment-1091011881

   @msokolov @jpountz Thank you for your comments, I refactored method name as 
`ArrayUtil#growNoCopy`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #790: LUCENE-10436: Remove deprecated DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery

2022-04-06 Thread GitBox


zacharymorn commented on PR #790:
URL: https://github.com/apache/lucene/pull/790#issuecomment-1091119984

   > > For the change entry, I assume this should go into version 10.0.0?
   > 
   > Yes, we need a CHANGES entry under 10.0.0 and a new entry in 
`lucene/MIGRATE.txt` that recommends replacing `DocValueFieldExistsQuery` and 
others with `FieldExistsQuery`.
   
   Sounds good. Added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a diff in pull request #790: LUCENE-10436: Remove deprecated DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery

2022-04-06 Thread GitBox


zacharymorn commented on code in PR #790:
URL: https://github.com/apache/lucene/pull/790#discussion_r844733404


##
lucene/core/src/java/org/apache/lucene/search/UsageTrackingQueryCachingPolicy.java:
##
@@ -58,12 +58,6 @@ private static boolean shouldNeverCache(Query query) {
   return true;
 }
 
-if (query instanceof DocValuesFieldExistsQuery) {
-  // We do not bother caching DocValuesFieldExistsQuery queries since they 
are already plenty
-  // fast.
-  return true;
-}

Review Comment:
   Oh sorry I should have added a nocommit for this. Given `FieldExistsQuery` 
now supports norms and vectors in addition to doc values, would not caching for 
also norms and vectors here hurt performance, if we were to have similar 
instance of check for `FieldExistsQuery`? I'm also wondering if there's a 
luceneutil like benchmark for these as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn merged pull request #791: LUCENE-10436: (Backporting) Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery with FieldExistsQuery

2022-04-06 Thread GitBox


zacharymorn merged PR #791:
URL: https://github.com/apache/lucene/pull/791


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10436) Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery into a single FieldExistsQuery?

2022-04-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518601#comment-17518601
 ] 

ASF subversion and git services commented on LUCENE-10436:
--

Commit a42326b9ef90a77910a7dcaf46997b53da6266b1 in lucene's branch 
refs/heads/branch_9x from zacharymorn
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a42326b9ef9 ]

LUCENE-10436: Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and 
KnnVectorFieldExistsQuery with FieldExistsQuery (#767) (#791)



> Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and 
> KnnVectorFieldExistsQuery into a single FieldExistsQuery?
> --
>
> Key: LUCENE-10436
> URL: https://issues.apache.org/jira/browse/LUCENE-10436
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Now that we require consistency across data structures, we could merge 
> DocValuesFieldExistsQuery, NormsFieldExistsQuery and 
> KnnVectorFieldExistsQuery together into a FieldExistsQuery that would require 
> that the field indexes either norms, doc values or vectors?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn opened a new pull request, #798: LUCENE-10436: (Backport) Remove usage of DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery

2022-04-06 Thread GitBox


zacharymorn opened a new pull request, #798:
URL: https://github.com/apache/lucene/pull/798

   Backporting PR https://github.com/apache/lucene/pull/790 without removal of 
the deprecated queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #790: LUCENE-10436: Remove deprecated DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery

2022-04-06 Thread GitBox


zacharymorn commented on PR #790:
URL: https://github.com/apache/lucene/pull/790#issuecomment-1091130105

   > Great. We should backport these changes but the actual removals to 9.x to 
address deprecation warnings.
   
   Thanks for the review! I've created the backporting PR for 9.x here 
https://github.com/apache/lucene/pull/798


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #790: LUCENE-10436: Remove deprecated DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery

2022-04-06 Thread GitBox


jpountz commented on code in PR #790:
URL: https://github.com/apache/lucene/pull/790#discussion_r844759861


##
lucene/core/src/java/org/apache/lucene/search/UsageTrackingQueryCachingPolicy.java:
##
@@ -58,12 +58,6 @@ private static boolean shouldNeverCache(Query query) {
   return true;
 }
 
-if (query instanceof DocValuesFieldExistsQuery) {
-  // We do not bother caching DocValuesFieldExistsQuery queries since they 
are already plenty
-  // fast.
-  return true;
-}

Review Comment:
   I feel good about not having a benchmark for this. The reasoning is that if 
the index has a data structure that supports running the query very 
efficiently, then we should just use it and skip caching. And we have this for 
doc values, norms and vectors. In contrast, boolean queries for instance need 
to reconcile multiple queries together, which has overhead.
   
   So +1 to exclude FieldExistsQuery from caching entirely.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org