Re: [PR] Copy collected acc(maxFreqs) into empty acc, rather than merge them. [lucene]
vsop-479 commented on code in PR #12846: URL: https://github.com/apache/lucene/pull/12846#discussion_r1405891395 ## lucene/core/src/java/org/apache/lucene/codecs/CompetitiveImpactAccumulator.java: ## @@ -93,6 +93,21 @@ public void addAll(CompetitiveImpactAccumulator acc) { assert assertConsistent(); } + /** Copy {@code acc} into this empty acc. */ + public void copy(CompetitiveImpactAccumulator acc) { +int[] maxFreqs = this.maxFreqs; +int[] otherMaxFreqs = acc.maxFreqs; +assert Arrays.stream(maxFreqs).sum() == 0; + +System.arraycopy(otherMaxFreqs, 0, maxFreqs, 0, maxFreqs.length); + +for (Impact entry : acc.otherFreqNormPairs) { + add(entry, otherFreqNormPairs); +} Review Comment: > Did you observe a speedup with this change? I measured it with TestTermScorer.testRandomTopDocs without assertion, the result it not stable. I will try other benchmarks. > Also can you add a test? > It looks like we can optimize this bit as well by copying entries from the treeset directly? Thanks for your suggestion, i am working on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for index sorting with document blocks [lucene]
mikemccand commented on code in PR #12829: URL: https://github.com/apache/lucene/pull/12829#discussion_r1406124683 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java: ## @@ -262,6 +277,73 @@ long updateDocuments( } } + private interface DocValidator extends Consumer { Review Comment: Do we really need a new interface here? We have only one validator now (`ParentBlockValidator`) -- can we just make a concrete instance of that which IW calls? ## lucene/core/src/java/org/apache/lucene/search/Sort.java: ## @@ -59,10 +61,28 @@ public Sort() { * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. */ public Sort(SortField... fields) { +this(null, fields); + } + + /** + * Sets the sort to the given criteria in succession: the first SortField is checked first, but if + * it produces a tie, then the second SortField is used to break the tie, etc. Finally, if there + * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. + * + * @param parentField the name of a numeric doc values field that marks the last document of a + * document blocks indexed with {@link + * org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's update relatives. This + * is required for indices that use index sorting in combination with document blocks in order + * to maintain the document order of the blocks documents. Index sorting will effectively + * compare the parent (last document) of a block in order to stable sort all it's adjacent Review Comment: s/it's/its ## lucene/core/src/java/org/apache/lucene/search/Sort.java: ## @@ -59,10 +61,28 @@ public Sort() { * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. */ public Sort(SortField... fields) { +this(null, fields); + } + + /** + * Sets the sort to the given criteria in succession: the first SortField is checked first, but if + * it produces a tie, then the second SortField is used to break the tie, etc. Finally, if there + * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. + * + * @param parentField the name of a numeric doc values field that marks the last document of a Review Comment: Can we make it clearer that you only use this for index-time sorting? It has no meaning at search time (and should not be set)? Or ... does it have search-time meaning? Can I use this to retrieve whole blocks in my search hits? ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java: ## @@ -262,6 +277,73 @@ long updateDocuments( } } + private interface DocValidator extends Consumer { + +default void afterDocument() {} + +default void beforeDocument() {} + +default void afterDocuments(int numDocs) {} + +default void reset() {} + } + + private static class ParentBlockValidator implements DocValidator { +private boolean lastDocHasParentField = false; Review Comment: You don't need the `= false` -- it's java's default already :) ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java: ## @@ -262,6 +277,73 @@ long updateDocuments( } } + private interface DocValidator extends Consumer { + +default void afterDocument() {} + +default void beforeDocument() {} + +default void afterDocuments(int numDocs) {} + +default void reset() {} + } + + private static class ParentBlockValidator implements DocValidator { +private boolean lastDocHasParentField = false; +private boolean childDocHasParentField = false; +private final String parentFieldName; + +@Override +public void reset() { + lastDocHasParentField = false; + childDocHasParentField = false; +} + +private ParentBlockValidator(Sort sort) { + this.parentFieldName = sort.getParentField(); +} + +@Override +public void beforeDocument() { + if (childDocHasParentField == false && lastDocHasParentField) { +childDocHasParentField = true; + } + lastDocHasParentField = false; +} + +@Override +public void accept(IndexableField field) { + if (parentFieldName != null) { +if (parentFieldName.equals(field.name()) +&& DocValuesType.NUMERIC == field.fieldType().docValuesType()) { + lastDocHasParentField = true; +} + } +} + +@Override +public void afterDocument() { + if (childDocHasParentField) { +throw new IllegalArgumentException( +"only the last document in the block must contain a numeric doc values field named: " Review Comment: Maybe reword to `"child documents must not contain the parent doc values field \"" + parentFieldName + "\""` or so? ##
Re: [PR] Add support for index sorting with document blocks [lucene]
mikemccand commented on PR #12829: URL: https://github.com/apache/lucene/pull/12829#issuecomment-1827825175 > using a doc-value field where only parents documents have a value for the field, and the value must be the number of child documents that the parent has This is a neat idea too -- would this maybe make completing the join at search time more efficient? Today we scan for the next set bit, but I guess with this approach, we'd still scan for the next `NumericDocValues` that is set, and then see the child count, but that would be sort of redundant (we could still subtract docids to get the number of children)? Also, I'm not sure how the nested case would be supported (same as the current PR's approach too). > This makes me wonder if Lucene could be adding this doc value field for root documents automatically, similarly to what it is doing for soft deletes? (Possibly in a follow-up PR) This is a nice idea too, but I think we'd still need user input to give us a free field name to use for this purpose. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Upgrade to OpenNLP 2.0 and add [LUCENE-10621] [lucene]
epugh commented on issue #11657: URL: https://github.com/apache/lucene/issues/11657#issuecomment-1827887052 OpenNLP 2.3.1 was recently released and would be nice to have Lucene pick it up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add static function in TaskExecutor to retrieve the results for a collection of Future [lucene]
javanna commented on PR #12798: URL: https://github.com/apache/lucene/pull/12798#issuecomment-1828210981 With the latest updates, I am not convinced about this change. I think it's great to use TaskExecutor to execute parallel tasks, like you did in #12799, but I am under the impression that this PR takes it a bit too far in terms of responsibilities of TaskExecutor and functionalities it exposes. I woudl be ok with those consumers that deal with futures keeping their current logic that does not depend on TaskExecutor, unless they use TaskExecutor directly to run their futures. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] upgrade to OpenNLP 2.3.0 [lucene]
epugh commented on PR #12674: URL: https://github.com/apache/lucene/pull/12674#issuecomment-1828340906 FYI 2.3.1 was just released. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for index sorting with document blocks [lucene]
msokolov commented on PR #12829: URL: https://github.com/apache/lucene/pull/12829#issuecomment-1828402628 > I don't think we give up any functionality. can you elaborate what functionality you are referring to? I don't think we should have a list of parent fields that IW requires, what would that list be used for? I also don't understand how the APU lets you define overlapping relations that this change would prevent you from? I think we have always been able to model a multi-level hierarchy. Suppose I index a book broken into documents representing paragraphs grouped into sections, chapters, and parts; so it's a five-level hierarchy. There can be documents at every level and we want to be able to do roll-ups to any level. If I index a book as a block, we support this today. I might be confused, but I didn't think that would work with this arrangement. However maybe it does? I guess one describes every component document as having the book as a parent, and then later can still use a different parent bitset to do the roll-ups to other levels? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]
mikemccand commented on code in PR #12699: URL: https://github.com/apache/lucene/pull/12699#discussion_r140662 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java: ## @@ -104,13 +104,9 @@ public SegmentTermsEnumFrame(SegmentTermsEnum ste, int ord) throws IOException { suffixLengthsReader = new ByteArrayDataInput(); } - public void setFloorData(ByteArrayDataInput in, BytesRef source) { -final int numBytes = source.length - (in.getPosition() - source.offset); -if (numBytes > floorData.length) { - floorData = new byte[ArrayUtil.oversize(numBytes, 1)]; -} -System.arraycopy(source.bytes, source.offset + in.getPosition(), floorData, 0, numBytes); -floorDataReader.reset(floorData, 0, numBytes); + public void setFloorData(SegmentTermsEnum.OutputAccumulator outputAccumulator) { +outputAccumulator.setFloorData(floorDataReader); +rewindPos = floorDataReader.getPosition(); Review Comment: Woops you're right! Sorry :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains… [lucene]
msfroh commented on PR #12750: URL: https://github.com/apache/lucene/pull/12750#issuecomment-1828469855 I was looking into this and the approach used for (Edge)NGramTokenizer back in 2013: https://github.com/apache/lucene/commit/a03e38d5d05008aaef969a200071c03a1d6cb991 The solution there is to *always* set the position increment and length to 1: https://github.com/apache/lucene/blob/8ef6a0da56878177ff8d6880c92e8f7d0321d076/lucene/analysis/common/src/java/org/apache/lucene/analysis/ngram/NGramTokenizer.java#L186-L187 With that change, your test passes (but I had to change every other test): https://github.com/msfroh/lucene/commit/0d05366c65a79aabc407e0662537520ba9c56737 Given that it's not backward-compatible, I imagine it would have to be a change for 10.0? Also, whatever we do should probably also be applied to ReversePathHierarchyTokenizer too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
mikemccand commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1828548480 Hmm I'm running `Test2BFSTs` on this patch and noticed it seems to take very much longer during the `TEST: now verify` step where it confirms the built FST accepts all the inputs it just compiled into it. I `jstack`'d it and found this: ``` java.lang.Thread.State: RUNNABLE at java.nio.ByteBuffer.position(java.base@17.0.9/ByteBuffer.java:1516) at java.nio.ByteBuffer.position(java.base@17.0.9/ByteBuffer.java:267) at java.nio.Buffer.(java.base@17.0.9/Buffer.java:246) at java.nio.ByteBuffer.(java.base@17.0.9/ByteBuffer.java:288) at java.nio.HeapByteBuffer.(java.base@17.0.9/HeapByteBuffer.java:95) at java.nio.HeapByteBufferR.(java.base@17.0.9/HeapByteBufferR.java:102) at java.nio.HeapByteBufferR.duplicate(java.base@17.0.9/HeapByteBufferR.java:135) at java.nio.HeapByteBufferR.asReadOnlyBuffer(java.base@17.0.9/HeapByteBufferR.java:148) at org.apache.lucene.store.ByteBuffersDataInput.(ByteBuffersDataInput.java:60) at org.apache.lucene.store.ByteBuffersDataOutput.toDataInput(ByteBuffersDataOutput.java:279) at org.apache.lucene.util.fst.ReadWriteDataOutput.getReverseBytesReader(ReadWriteDataOutput.java:52) at org.apache.lucene.util.fst.FST.getBytesReader(FST.java:1181) at org.apache.lucene.util.fst.Util.get(Util.java:50) at org.apache.lucene.util.fst.Test2BFST.test(Test2BFST.java:113) ``` It looks like getting the reversed reader has maybe become quite a bit more expensive than `BytesStore` was? Note that `Test2BFSTs` is quite silly -- it uses `Util.get` for every lookup, instead of pulling and reusing a `ReverseBytesReader` like most "real" usages of FST (e.g. terms dict) will do. So perhaps this performance change doesn't really matter in practice? Though I wonder if all consumers of FST are re-using their reversed readers? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
mikemccand commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1828590265 Hmm, also the `FSTCompiler.ramBytesUsed()` seems to no longer return the growing FST size: ``` 1> 310: 560 bytes; 594876500 nodes 1> 320: 560 bytes; 614066389 nodes ``` vs 9.x output: ``` 1> 310: 1220879080 bytes; 594876036 nodes 1> 320: 1260262272 bytes; 614066120 nodes ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
mikemccand commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1828597325 Hmm, also oddly -- why do the number of nodes differ between `main` and 9.x? This PR should not have altered how many nodes are created as a function of FST inputs right? Or maybe how the RAM accounting is done for NodeHash changes a bit, causing a slightly larger FST on `main` than 9.x? So many WTFs ("wow that's funny") suddenly :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1828839806 Ah I think since we removed the finish(), getting the reverse bytes reader is expectedly slower. We have to copy the bytes to a readonly buffer every time. If this is a problem maybe let try to add the finish() again? WDYT The ram bytes used is an easy fix. I'll change it in next rev. The difference in number of nodes is interesting. Lwt me look deeper. But did the 9.x you are testing has incorporated the value-based node hash? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]
zacharymorn merged PR #240: URL: https://github.com/apache/lucene/pull/240 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1828936176 I checked some of the usage in the analysis module. SynonymGraphFilter cache the `BytesReader` on constructor, and I think TokenFilter by default are cached per field? But lots of other places does not have this cache, such as: - Stemmer - GeneratingSuggester - ... I think the main bottleneck are the non-trivial creation of the BufferList (in `toBufferList` or `toWritableBufferList`) and the ByteBufferDataInput. In the worst case, how do you think about reviving the BytesStore (with a much simpler implementation as it does have the scratch bytes operation)? Or alternatively, we could create a reusable block-based byte array DataOutput (not sure if there's already something like that). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Report the time it took for building the FST [lucene]
dungba88 opened a new pull request, #12847: URL: https://github.com/apache/lucene/pull/12847 ### Description - Report the time it took for building the FST - Report the FST actual size, as it can differ from the RAM bytes used once the test is moved to off-heap -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]
gf2121 commented on PR #12699: URL: https://github.com/apache/lucene/pull/12699#issuecomment-1829112668 Thanks for review and great suggestions @mikemccand ! > you want to merge and backport to 9.x? Yes. I'll merge and backport this this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]
gf2121 merged PR #12699: URL: https://github.com/apache/lucene/pull/12699 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1829144978 Tested Test2BFST with `-Dtests.seed=D193E7FD4B9E68C4` **mainline** ``` 110: 432584968 RAM bytes used; 432367203 FST bytes; 211082699 nodes; took 248 seconds ``` **pr** ``` 110: 432427020 RAM bytes used; 432367203 FST bytes; 211082699 nodes; took 236 seconds ``` **9_x** ``` 110: 432584968 RAM bytes used; 432367203 FST bytes; 211082699 nodes; took 260 seconds ``` The FST size, number nodes are exactly the same, while the running time doesn't seem to differ much (running with my personal computer so the background process could be noisy). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org