[GitHub] [lucene] LuXugang commented on a diff in pull request #12405: Skip docs with Docvalues in NumericLeafComparator
LuXugang commented on code in PR #12405: URL: https://github.com/apache/lucene/pull/12405#discussion_r1266458966 ## lucene/core/src/java/org/apache/lucene/search/comparators/DoubleComparator.java: ## @@ -61,8 +63,12 @@ public LeafFieldComparator getLeafComparator(LeafReaderContext context) throws I /** Leaf comparator for {@link DoubleComparator} that provides skipping functionality */ public class DoubleLeafComparator extends NumericLeafComparator { +private final byte[] deltaOne; + public DoubleLeafComparator(LeafReaderContext context) throws IOException { super(context); + deltaOne = new byte[Double.BYTES]; + DoublePoint.encodeDimension(1d, deltaOne, 0); Review Comment: addressed in [ef9f917](https://github.com/apache/lucene/pull/12405/commits/ef9f9173bbba04871dd489187cbdc26505d64418) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #12405: Skip docs with Docvalues in NumericLeafComparator
LuXugang commented on code in PR #12405: URL: https://github.com/apache/lucene/pull/12405#discussion_r1266460078 ## lucene/core/src/java/org/apache/lucene/search/comparators/IntComparator.java: ## @@ -98,19 +99,30 @@ public void copy(int slot, int doc) throws IOException { @Override protected boolean isMissingValueCompetitive() { int result = Integer.compare(missingValue, bottom); - // in reverse (desc) sort missingValue is competitive when it's greater or equal to bottom, - // in asc sort missingValue is competitive when it's smaller or equal to bottom - return reverse ? (result >= 0) : (result <= 0); + return reverse + ? (pruning == Pruning.SKIP_MORE ? result > 0 : result >= 0) + : (pruning == Pruning.SKIP_MORE ? result < 0 : result <= 0); } @Override protected void encodeBottom(byte[] packedValue) { - IntPoint.encodeDimension(bottom, packedValue, 0); + if (pruning == Pruning.SKIP_MORE) { +IntPoint.encodeDimension(reverse ? bottom + 1 : bottom - 1, packedValue, 0); + } else { +IntPoint.encodeDimension(bottom, packedValue, 0); + } Review Comment: addressed in [ef9f917](https://github.com/apache/lucene/pull/12405/commits/ef9f9173bbba04871dd489187cbdc26505d64418) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #12405: Skip docs with Docvalues in NumericLeafComparator
LuXugang commented on code in PR #12405: URL: https://github.com/apache/lucene/pull/12405#discussion_r1266459560 ## lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java: ## @@ -309,34 +323,100 @@ private void updateSkipInterval(boolean success) { } } -@Override -public DocIdSetIterator competitiveIterator() { - if (enableSkipping == false) return null; - return new DocIdSetIterator() { -private int docID = competitiveIterator.docID(); +private class CompetitiveIterator extends DocIdSetIterator { + + private final LeafReaderContext context; + private final int maxDoc; + private final String field; + private int doc = -1; + private DocIdSetIterator docsWithDocValue; + private DocIdSetIterator docsWithPoint; + private final byte[] minPackedValue; + private final byte[] maxPackedValue; + private final boolean skipWithDocValues; + + CompetitiveIterator( + LeafReaderContext context, + String field, + boolean skipWithDocValues, + byte[] minPackedValue, + byte[] maxPackedValue) { +this.context = context; +this.maxDoc = context.reader().maxDoc(); +this.field = field; +this.skipWithDocValues = skipWithDocValues; +this.minPackedValue = minPackedValue; +this.maxPackedValue = maxPackedValue; + } -@Override -public int nextDoc() throws IOException { - return advance(docID + 1); -} + @Override + public int docID() { +return doc; + } -@Override -public int docID() { - return docID; -} + @Override + public int nextDoc() throws IOException { +return advance(docID() + 1); + } -@Override -public long cost() { - return competitiveIterator.cost(); + @Override + public int advance(int target) throws IOException { +if (target >= maxDoc) { + return doc = NO_MORE_DOCS; +} else if (docsWithPoint != null) { + assert hitsThresholdReached == true; + return doc = docsWithPoint.advance(target); +} else if (docsWithDocValue != null) { + assert hitsThresholdReached == true; + return doc = docsWithDocValue.advance(target); +} else { + return doc = target; } + } -@Override -public int advance(int target) throws IOException { - return docID = competitiveIterator.advance(target); + @Override + public long cost() { +return context.reader().maxDoc(); + } + + private void setDocsWithDocValue() throws IOException { +// if dense == true, all documents have docValues, no sense to skip by docValues +// docsWithDocValue need only init once +// if missing values are always competitive, we can never skip via doc values +if (skipWithDocValues +&& docsWithDocValue == null +&& isMissingValueNotCompetitive(minPackedValue, maxPackedValue)) { + this.docsWithDocValue = getNumericDocValues(context, field); } - }; + } + + private void updateDocsWithPoint(DocIdSetIterator iterator) { +this.docsWithPoint = iterator; + } + + private long docsWithPointCost() { +return docsWithPoint.cost(); + } } +@Override +public DocIdSetIterator competitiveIterator() { + return competitiveIterator; +} + +/** + * if reverse == true, missing value is non-competitive when it less than minPackedValue, if + * reverse == false, missing value is non-competitive when it great than maxPackedValue + */ +protected abstract boolean isMissingValueNotCompetitive( +byte[] minPackedValue, byte[] maxPackedValue); Review Comment: addressed in [69c7259](https://github.com/apache/lucene/pull/12405/commits/69c72591d895813a0d045ee170b26a5de3e0e47e) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter opened a new pull request, #12447: hunspell: speed up the dictionary enumeration
donnerpeter opened a new pull request, #12447: URL: https://github.com/apache/lucene/pull/12447 cache each word's case and the lowercase form group the words by lengths to avoid even visiting entries with unneeded lengths -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter commented on pull request #12447: hunspell: speed up the dictionary enumeration
donnerpeter commented on PR #12447: URL: https://github.com/apache/lucene/pull/12447#issuecomment-1639903635 This improves suggestion performance for de/en/ru/uk by 15-30% -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gashutos opened a new issue, #12448: [Performance] sort query improvement for sequential ordered data [e.g. timestamp field sort in log data]
gashutos opened a new issue, #12448: URL: https://github.com/apache/lucene/issues/12448 ### Description ## Problem statement Currently in `TopFieldCollector`, we have PriorityQueue (min heap binary implemenation) to find top `K` elements in `asc` or `desc` order. Whenever we find newly iterated document competitive, we insert that in PriorityQueue and if size is full, we remove `min` competitive document from PriorityQueue and replace with newly competitive document. Time complexity for this scenatio is `O(nlog(K))` in worst case where `n = current number of competitive documents`. Now what if `n` is pretty high and every document we try to insert in PriorityQueue take exact O(log(K)) ? We will end up spending lot of time in that case within PriorityQueue heapification process itself. We have skipping logic to reduce iterator cost 'n' whenever we update PriorityQueue, but that doesnt work well in some scenario. ### For example, for time series based logs/metric data where sort query is performed on `timestamp` field. This field will be always ever increasing (nearly sorted) field and it has very high cardinality as well ! Triggering `desc` order queries wont optimize much with skipping logic as well PriorityQueue will always result `O(log(K))` for almost all heapification insertion because incming documents are coming in increasing order. Like imagine below scenario where we are inserting logical timestamp in incrwasing order `1, 2, 3, 4, ` and insertion of 16 will result in worst case to heapify 16 till leaf node, same is true for `17, 18, 19,.`. (Imagine we need top `15` hit so PQ size is 15). https://github.com/apache/lucene/assets/104654647/dcdd1e36-ebc7-46f3-a847-fdec243b0c26";> Here in above scenario, skipping logic doesnt work too to find top `K` in `desc` order. But we can do better on priorityqueue side since we know incoming documents are in `asc` order (nearly sorted) so can skip heapifying them. ## Solution proposed The crux of the solution proposed is, `We dont need to heapify elements if the incoming order is sorted as per desired order`. The idea here is to keep incoming competitive documents in temporary `circular array` before inserting them into priorityqueue and trigger heapify. ### Algorithm 1. Create circular array of same size as priority queue. We shoud be able add element to `last` and remove element from `first` in this circular array implementation. 2. If incoming document qualifies after `TopFieldCollector.checkThreshold()`, keep inserting them in circular array until order breaks, i.e. if next document is smaller for descending order traveral or vice versa for ascending order. Remove first element from circular array if queue is full while inserting. 3. Once order breaks, insert entire circular array into priority queue and empty the circular array. With this, we will able to skip millions of hits from going through heapification process. POC : [POC code](https://github.com/gashutos/lucene/commit/1822f6a91b8fa11aebe80fe04b124f2c48ae2e6f) _Implemented above POC for reference with LinkedList instead circular array. Also need some additional stuff to handle cases like duplicate element where docid is given preference (But those should be minor changes)._ ### Caveat of this implementation Above proposed solution works pretty well for ever increasing or ever decreasing ordered fields like time series based data. But this could be additional overhead if data is completely random. This will incur additional comparison to add element in circular array first and then inserting to actual heap. We have two ways to implement this, 1. Have this implementation by default for all types of BKD based indexed field. 2. Enable this implementation behind `boolean` flag, and it can be set `true/false` at collector level similar to what we have for `PointBasedOptimization` flag. The approach 2 sounds very safe, but it results on more configuration overhead for users. The approach 1 can introduce overhead, but with the random data the skipping logic based on BKD points works pretty well and insertion on priority queue itself wont be much and hence we dont see much overhead. I have tested this with `13` million documents randomly on a numeric field (long) and with solution 1, we didnt see any regression in latency since skipping logic skipped many hits already. We dont have benchmark for numeric sort in Lucene itself, but I did indexed LongPoint in random order with 13 million points/documents. I see similar latency for sort `140 ms` while with this additional circular queue it is `147 ms` on `r6g.2xlarge` ec2 instances. ## Suggestions from community I would like to hear suggestion thoughts from community for above proposed change and better way to implement this. Specially time series based `desc`
[GitHub] [lucene] MartinDemberger commented on a diff in pull request #12437: LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches
MartinDemberger commented on code in PR #12437: URL: https://github.com/apache/lucene/pull/12437#discussion_r1267179992 ## lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/TestHyphenationCompoundWordTokenFilterFactory.java: ## @@ -47,6 +47,33 @@ public void testHyphenationWithDictionary() throws Exception { new int[] {1, 1, 1, 1, 1, 1, 1, 1, 0, 0}); } + /** + * just tests that the two no configuration options are correctly processed tests for the + * functionality are part of {@link TestCompoundWordTokenFilter} + */ + public void testLucene8183() throws Exception { +Reader reader = new StringReader("basketballkurv"); +TokenStream stream = new MockTokenizer(MockTokenizer.WHITESPACE, false); +((Tokenizer) stream).setReader(reader); +stream = +tokenFilterFactory( +"HyphenationCompoundWord", +"hyphenator", +"da_UTF8.xml", Review Comment: This dictionary is mentioned in the original issue. Should I change to a german dictionary? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter merged pull request #12447: hunspell: speed up the dictionary enumeration
donnerpeter merged PR #12447: URL: https://github.com/apache/lucene/pull/12447 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] MartinDemberger commented on a diff in pull request #12437: LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches
MartinDemberger commented on code in PR #12437: URL: https://github.com/apache/lucene/pull/12437#discussion_r1267286254 ## lucene/CHANGES.txt: ## @@ -65,6 +65,8 @@ New Features * LUCENE-10626 Hunspell: add tools to aid dictionary editing: analysis introspection, stem expansion and stem/flag suggestion (Peter Gromov) +* Lucene-8183: Added the abbility to get noSubMatches and noOverlappingMatches in HyphenationCompoundWordFilter (Martin Demberger, original from Rupert Westenthaler) Review Comment: I have moved it. Is this the correct position? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] MartinDemberger commented on pull request #12437: LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches
MartinDemberger commented on PR #12437: URL: https://github.com/apache/lucene/pull/12437#issuecomment-1640968390 This is my first PR for lucene. Thank you for your patience and help. If I can improve something please let me know. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] MartinDemberger commented on issue #9231: HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled [LUCENE-8183]
MartinDemberger commented on issue #9231: URL: https://github.com/apache/lucene/issues/9231#issuecomment-1640974145 > Does this also fix #4096 ? I'm sorry but no. #4096 notifies DictionaryCompoundWordTokenFilter but this one is about HyphenationCompoundWordTokenFilter Maybe the change can be adopted but currently I don't have time and knowlege to do this. Maybe I have a look on it in a few weeks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #12437: LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches
uschindler commented on code in PR #12437: URL: https://github.com/apache/lucene/pull/12437#discussion_r1267297167 ## lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/TestHyphenationCompoundWordTokenFilterFactory.java: ## @@ -47,6 +47,33 @@ public void testHyphenationWithDictionary() throws Exception { new int[] {1, 1, 1, 1, 1, 1, 1, 1, 0, 0}); } + /** + * just tests that the two no configuration options are correctly processed tests for the + * functionality are part of {@link TestCompoundWordTokenFilter} + */ + public void testLucene8183() throws Exception { +Reader reader = new StringReader("basketballkurv"); +TokenStream stream = new MockTokenizer(MockTokenizer.WHITESPACE, false); +((Tokenizer) stream).setReader(reader); +stream = +tokenFilterFactory( +"HyphenationCompoundWord", +"hyphenator", +"da_UTF8.xml", Review Comment: The problem with the dictionary is the license. So we should not add the originals. If this hacky small Sixt works for the test all is fine. ## lucene/CHANGES.txt: ## @@ -65,6 +65,8 @@ New Features * LUCENE-10626 Hunspell: add tools to aid dictionary editing: analysis introspection, stem expansion and stem/flag suggestion (Peter Gromov) +* Lucene-8183: Added the abbility to get noSubMatches and noOverlappingMatches in HyphenationCompoundWordFilter (Martin Demberger, original from Rupert Westenthaler) Review Comment: You just moved it to enhancement. Please move it down after the 9.8 version because I will merge this later to this branch So fix Version will be Luce e 9.8 not 10.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #12437: LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches
uschindler commented on code in PR #12437: URL: https://github.com/apache/lucene/pull/12437#discussion_r1267299166 ## lucene/CHANGES.txt: ## @@ -65,6 +65,8 @@ New Features * LUCENE-10626 Hunspell: add tools to aid dictionary editing: analysis introspection, stem expansion and stem/flag suggestion (Peter Gromov) +* Lucene-8183: Added the abbility to get noSubMatches and noOverlappingMatches in HyphenationCompoundWordFilter (Martin Demberger, original from Rupert Westenthaler) Review Comment: You just moved it to enhancement. Please move it down after the 9.8 version because I will merge this later to this branch So fix Version will be Lucene 9.8 not 10.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] asubbu90 opened a new issue, #12449: byte to int in TruncateTokenFilterFactory to TruncateTokenFilter
asubbu90 opened a new issue, #12449: URL: https://github.com/apache/lucene/issues/12449 ### Description TruncateTokenFilterFactory class parses PREFIX_LENGTH_KEY value as Byte which goes upto 127 and then is stored in prefixLength attribute. TruncateTokenFilter class expects the argument in int which has a bigger range than byte. Any value greater than 127 throws a exception while being parsed as Byte in the TruncateTokenFilterFactory class. I didnt see any documentation in the TruncateTokenFilterFactory class that this value should be less than 128. ### Version and environment details Lucene 9.7.0 . Also verified in latest main code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org