Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-27 Thread via GitHub
dungba88 commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1829144978 Tested Test2BFST with `-Dtests.seed=D193E7FD4B9E68C4` **mainline** ``` 110: 432584968 RAM bytes used; 432367203 FST bytes; 211082699 nodes; took 248 seconds ```

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-27 Thread via GitHub
gf2121 merged PR #12699: URL: https://github.com/apache/lucene/pull/12699 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-27 Thread via GitHub
gf2121 commented on PR #12699: URL: https://github.com/apache/lucene/pull/12699#issuecomment-1829112668 Thanks for review and great suggestions @mikemccand ! > you want to merge and backport to 9.x? Yes. I'll merge and backport this this. -- This is an automated message from

[PR] Report the time it took for building the FST [lucene]

2023-11-27 Thread via GitHub
dungba88 opened a new pull request, #12847: URL: https://github.com/apache/lucene/pull/12847 ### Description - Report the time it took for building the FST - Report the FST actual size, as it can differ from the RAM bytes used once the test is moved to off-heap -- This is an aut

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-27 Thread via GitHub
dungba88 commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1828936176 I checked some of the usage in the analysis module. SynonymGraphFilter cache the `BytesReader` on constructor, and I think TokenFilter by default are cached per field? But lots of other

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-27 Thread via GitHub
zacharymorn merged PR #240: URL: https://github.com/apache/lucene/pull/240 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-27 Thread via GitHub
dungba88 commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1828839806 Ah I think since we removed the finish(), getting the reverse bytes reader is expectedly slower. We have to copy the bytes to a readonly buffer every time. If this is a problem maybe le

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-27 Thread via GitHub
mikemccand commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1828597325 Hmm, also oddly -- why do the number of nodes differ between `main` and 9.x? This PR should not have altered how many nodes are created as a function of FST inputs right? Or maybe h

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-27 Thread via GitHub
mikemccand commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1828590265 Hmm, also the `FSTCompiler.ramBytesUsed()` seems to no longer return the growing FST size: ``` 1> 310: 560 bytes; 594876500 nodes 1> 320: 560 bytes; 614066389

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-27 Thread via GitHub
mikemccand commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1828548480 Hmm I'm running `Test2BFSTs` on this patch and noticed it seems to take very much longer during the `TEST: now verify` step where it confirms the built FST accepts all the inputs it j

Re: [PR] BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains… [lucene]

2023-11-27 Thread via GitHub
msfroh commented on PR #12750: URL: https://github.com/apache/lucene/pull/12750#issuecomment-1828469855 I was looking into this and the approach used for (Edge)NGramTokenizer back in 2013: https://github.com/apache/lucene/commit/a03e38d5d05008aaef969a200071c03a1d6cb991 The solution t

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-27 Thread via GitHub
mikemccand commented on code in PR #12699: URL: https://github.com/apache/lucene/pull/12699#discussion_r140662 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java: ## @@ -104,13 +104,9 @@ public SegmentTermsEnumFrame(SegmentTermsEnu

Re: [PR] Add support for index sorting with document blocks [lucene]

2023-11-27 Thread via GitHub
msokolov commented on PR #12829: URL: https://github.com/apache/lucene/pull/12829#issuecomment-1828402628 > I don't think we give up any functionality. can you elaborate what functionality you are referring to? I don't think we should have a list of parent fields that IW requires, what woul

Re: [PR] upgrade to OpenNLP 2.3.0 [lucene]

2023-11-27 Thread via GitHub
epugh commented on PR #12674: URL: https://github.com/apache/lucene/pull/12674#issuecomment-1828340906 FYI 2.3.1 was just released. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific co

Re: [PR] Add static function in TaskExecutor to retrieve the results for a collection of Future [lucene]

2023-11-27 Thread via GitHub
javanna commented on PR #12798: URL: https://github.com/apache/lucene/pull/12798#issuecomment-1828210981 With the latest updates, I am not convinced about this change. I think it's great to use TaskExecutor to execute parallel tasks, like you did in #12799, but I am under the impression tha

Re: [I] Upgrade to OpenNLP 2.0 and add [LUCENE-10621] [lucene]

2023-11-27 Thread via GitHub
epugh commented on issue #11657: URL: https://github.com/apache/lucene/issues/11657#issuecomment-1827887052 OpenNLP 2.3.1 was recently released and would be nice to have Lucene pick it up. -- This is an automated message from the Apache Git Service. To respond to the message, please log o

Re: [PR] Add support for index sorting with document blocks [lucene]

2023-11-27 Thread via GitHub
mikemccand commented on PR #12829: URL: https://github.com/apache/lucene/pull/12829#issuecomment-1827825175 > using a doc-value field where only parents documents have a value for the field, and the value must be the number of child documents that the parent has This is a neat idea to

Re: [PR] Add support for index sorting with document blocks [lucene]

2023-11-27 Thread via GitHub
mikemccand commented on code in PR #12829: URL: https://github.com/apache/lucene/pull/12829#discussion_r1406124683 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java: ## @@ -262,6 +277,73 @@ long updateDocuments( } } + private interface DocV

Re: [PR] Copy collected acc(maxFreqs) into empty acc, rather than merge them. [lucene]

2023-11-27 Thread via GitHub
vsop-479 commented on code in PR #12846: URL: https://github.com/apache/lucene/pull/12846#discussion_r1405891395 ## lucene/core/src/java/org/apache/lucene/codecs/CompetitiveImpactAccumulator.java: ## @@ -93,6 +93,21 @@ public void addAll(CompetitiveImpactAccumulator acc) {