[GitHub] [lucene] LuXugang commented on a diff in pull request #12405: Skip docs with Docvalues in NumericLeafComparator

2023-07-18 Thread via GitHub


LuXugang commented on code in PR #12405:
URL: https://github.com/apache/lucene/pull/12405#discussion_r1266458966


##
lucene/core/src/java/org/apache/lucene/search/comparators/DoubleComparator.java:
##
@@ -61,8 +63,12 @@ public LeafFieldComparator 
getLeafComparator(LeafReaderContext context) throws I
   /** Leaf comparator for {@link DoubleComparator} that provides skipping 
functionality */
   public class DoubleLeafComparator extends NumericLeafComparator {
 
+private final byte[] deltaOne;
+
 public DoubleLeafComparator(LeafReaderContext context) throws IOException {
   super(context);
+  deltaOne = new byte[Double.BYTES];
+  DoublePoint.encodeDimension(1d, deltaOne, 0);

Review Comment:
   addressed in 
[ef9f917](https://github.com/apache/lucene/pull/12405/commits/ef9f9173bbba04871dd489187cbdc26505d64418)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #12405: Skip docs with Docvalues in NumericLeafComparator

2023-07-18 Thread via GitHub


LuXugang commented on code in PR #12405:
URL: https://github.com/apache/lucene/pull/12405#discussion_r1266460078


##
lucene/core/src/java/org/apache/lucene/search/comparators/IntComparator.java:
##
@@ -98,19 +99,30 @@ public void copy(int slot, int doc) throws IOException {
 @Override
 protected boolean isMissingValueCompetitive() {
   int result = Integer.compare(missingValue, bottom);
-  // in reverse (desc) sort missingValue is competitive when it's greater 
or equal to bottom,
-  // in asc sort missingValue is competitive when it's smaller or equal to 
bottom
-  return reverse ? (result >= 0) : (result <= 0);
+  return reverse
+  ? (pruning == Pruning.SKIP_MORE ? result > 0 : result >= 0)
+  : (pruning == Pruning.SKIP_MORE ? result < 0 : result <= 0);
 }
 
 @Override
 protected void encodeBottom(byte[] packedValue) {
-  IntPoint.encodeDimension(bottom, packedValue, 0);
+  if (pruning == Pruning.SKIP_MORE) {
+IntPoint.encodeDimension(reverse ? bottom + 1 : bottom - 1, 
packedValue, 0);
+  } else {
+IntPoint.encodeDimension(bottom, packedValue, 0);
+  }

Review Comment:
   addressed in 
[ef9f917](https://github.com/apache/lucene/pull/12405/commits/ef9f9173bbba04871dd489187cbdc26505d64418)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #12405: Skip docs with Docvalues in NumericLeafComparator

2023-07-18 Thread via GitHub


LuXugang commented on code in PR #12405:
URL: https://github.com/apache/lucene/pull/12405#discussion_r1266459560


##
lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java:
##
@@ -309,34 +323,100 @@ private void updateSkipInterval(boolean success) {
   }
 }
 
-@Override
-public DocIdSetIterator competitiveIterator() {
-  if (enableSkipping == false) return null;
-  return new DocIdSetIterator() {
-private int docID = competitiveIterator.docID();
+private class CompetitiveIterator extends DocIdSetIterator {
+
+  private final LeafReaderContext context;
+  private final int maxDoc;
+  private final String field;
+  private int doc = -1;
+  private DocIdSetIterator docsWithDocValue;
+  private DocIdSetIterator docsWithPoint;
+  private final byte[] minPackedValue;
+  private final byte[] maxPackedValue;
+  private final boolean skipWithDocValues;
+
+  CompetitiveIterator(
+  LeafReaderContext context,
+  String field,
+  boolean skipWithDocValues,
+  byte[] minPackedValue,
+  byte[] maxPackedValue) {
+this.context = context;
+this.maxDoc = context.reader().maxDoc();
+this.field = field;
+this.skipWithDocValues = skipWithDocValues;
+this.minPackedValue = minPackedValue;
+this.maxPackedValue = maxPackedValue;
+  }
 
-@Override
-public int nextDoc() throws IOException {
-  return advance(docID + 1);
-}
+  @Override
+  public int docID() {
+return doc;
+  }
 
-@Override
-public int docID() {
-  return docID;
-}
+  @Override
+  public int nextDoc() throws IOException {
+return advance(docID() + 1);
+  }
 
-@Override
-public long cost() {
-  return competitiveIterator.cost();
+  @Override
+  public int advance(int target) throws IOException {
+if (target >= maxDoc) {
+  return doc = NO_MORE_DOCS;
+} else if (docsWithPoint != null) {
+  assert hitsThresholdReached == true;
+  return doc = docsWithPoint.advance(target);
+} else if (docsWithDocValue != null) {
+  assert hitsThresholdReached == true;
+  return doc = docsWithDocValue.advance(target);
+} else {
+  return doc = target;
 }
+  }
 
-@Override
-public int advance(int target) throws IOException {
-  return docID = competitiveIterator.advance(target);
+  @Override
+  public long cost() {
+return context.reader().maxDoc();
+  }
+
+  private void setDocsWithDocValue() throws IOException {
+// if dense == true, all documents have docValues, no sense to skip by 
docValues
+// docsWithDocValue need only init once
+// if missing values are always competitive, we can never skip via doc 
values
+if (skipWithDocValues
+&& docsWithDocValue == null
+&& isMissingValueNotCompetitive(minPackedValue, maxPackedValue)) {
+  this.docsWithDocValue = getNumericDocValues(context, field);
 }
-  };
+  }
+
+  private void updateDocsWithPoint(DocIdSetIterator iterator) {
+this.docsWithPoint = iterator;
+  }
+
+  private long docsWithPointCost() {
+return docsWithPoint.cost();
+  }
 }
 
+@Override
+public DocIdSetIterator competitiveIterator() {
+  return competitiveIterator;
+}
+
+/**
+ * if reverse == true, missing value is non-competitive when it less than 
minPackedValue, if
+ * reverse == false, missing value is non-competitive when it great than 
maxPackedValue
+ */
+protected abstract boolean isMissingValueNotCompetitive(
+byte[] minPackedValue, byte[] maxPackedValue);

Review Comment:
   addressed in 
[69c7259](https://github.com/apache/lucene/pull/12405/commits/69c72591d895813a0d045ee170b26a5de3e0e47e)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter opened a new pull request, #12447: hunspell: speed up the dictionary enumeration

2023-07-18 Thread via GitHub


donnerpeter opened a new pull request, #12447:
URL: https://github.com/apache/lucene/pull/12447

   cache each word's case and the lowercase form
   group the words by lengths to avoid even visiting entries with unneeded 
lengths


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter commented on pull request #12447: hunspell: speed up the dictionary enumeration

2023-07-18 Thread via GitHub


donnerpeter commented on PR #12447:
URL: https://github.com/apache/lucene/pull/12447#issuecomment-1639903635

   This improves suggestion performance for de/en/ru/uk by 15-30%


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gashutos opened a new issue, #12448: [Performance] sort query improvement for sequential ordered data [e.g. timestamp field sort in log data]

2023-07-18 Thread via GitHub


gashutos opened a new issue, #12448:
URL: https://github.com/apache/lucene/issues/12448

   ### Description
   
   ## Problem statement
   Currently in `TopFieldCollector`, we have PriorityQueue (min heap binary 
implemenation) to find top `K` elements in `asc` or `desc` order. Whenever we 
find newly iterated document competitive, we insert that in PriorityQueue and 
if size is full, we remove `min` competitive document from PriorityQueue and 
replace with newly competitive document. Time complexity for this scenatio is 
`O(nlog(K))` in worst case where `n = current number of competitive documents`. 
   Now what if `n` is pretty high and every document we try to insert in 
PriorityQueue take exact O(log(K)) ? We will end up spending lot of time in 
that case within PriorityQueue heapification process itself.
   We have skipping logic to reduce iterator cost 'n' whenever we update 
PriorityQueue, but that doesnt work well in some scenario.
   
   ### For example, 
   for time series based logs/metric data where sort query is performed on 
`timestamp` field. This field will be always ever increasing (nearly sorted) 
field and it has very high cardinality as well ! Triggering `desc` order 
queries wont optimize much with skipping logic as well PriorityQueue will 
always result `O(log(K))` for almost all heapification insertion because 
incming documents are coming in increasing order.
   
   Like imagine below scenario where we are inserting logical timestamp in 
incrwasing order `1, 2, 3, 4, ` and insertion of 16 will result in worst 
case to heapify 16 till leaf node, same is true for `17, 18, 19,.`.  
(Imagine we need top `15` hit so PQ size is 15).
   
   https://github.com/apache/lucene/assets/104654647/dcdd1e36-ebc7-46f3-a847-fdec243b0c26";>
   
   Here in above scenario, skipping logic doesnt work too to find top `K` in 
`desc` order. But we can do better on priorityqueue side since we know incoming 
documents are in `asc` order (nearly sorted) so can skip heapifying them.
   
   ## Solution proposed
   The crux of the solution proposed is, `We dont need to heapify elements if 
the incoming order is sorted as per desired order`.  The idea here is to keep 
incoming competitive documents in temporary `circular array` before inserting 
them into priorityqueue and trigger heapify. 
   
   ### Algorithm
   1. Create circular array of same size as priority queue. We shoud be able  
add element to `last` and remove element from `first` in this circular array 
implementation.
   2. If incoming document qualifies after 
`TopFieldCollector.checkThreshold()`, keep inserting them in circular array 
until order breaks, i.e. if next document is smaller for descending order 
traveral or vice versa for ascending order. Remove first element from circular 
array if queue is full while inserting.
   3. Once order breaks, insert entire circular array into priority queue and 
empty the circular array.
   
   With this, we will able to skip millions of hits from going through 
heapification process. 
   
   POC : [POC 
code](https://github.com/gashutos/lucene/commit/1822f6a91b8fa11aebe80fe04b124f2c48ae2e6f)
   _Implemented above POC for reference with LinkedList instead circular array. 
Also need some additional stuff to handle cases like duplicate element where 
docid is given preference (But those should be minor changes)._
   
   ### Caveat of this implementation
   Above proposed solution works pretty well for ever increasing or ever 
decreasing ordered fields like time series based data.  But this could be 
additional overhead if data is completely random. This will incur additional 
comparison to add element in circular array first and then inserting to actual 
heap.
   
   We have two ways to implement this, 
   1. Have this implementation by default for all types of BKD based indexed 
field.
   2. Enable this implementation behind `boolean` flag, and it can be set 
`true/false` at collector level similar to what we have for 
`PointBasedOptimization` flag.
   
   The approach 2 sounds very safe, but it results on more configuration 
overhead for users.
   The approach 1 can introduce overhead, but with the random data the skipping 
logic based on BKD points works pretty well and insertion on priority queue 
itself wont be much and hence we dont see much overhead. I have tested this 
with `13` million documents randomly on a numeric field (long) and with 
solution 1, we didnt see any regression in latency since skipping logic skipped 
many hits already.
   We dont have benchmark for numeric sort in Lucene itself, but I did indexed 
LongPoint in random order with 13 million points/documents. I see similar 
latency for sort `140 ms` while with this additional circular queue it is `147 
ms` on `r6g.2xlarge` ec2 instances.
   
   ## Suggestions from community
   I would like to hear suggestion thoughts from community for above proposed 
change and better way to implement this. Specially time series based `desc`

[GitHub] [lucene] MartinDemberger commented on a diff in pull request #12437: LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches

2023-07-18 Thread via GitHub


MartinDemberger commented on code in PR #12437:
URL: https://github.com/apache/lucene/pull/12437#discussion_r1267179992


##
lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/TestHyphenationCompoundWordTokenFilterFactory.java:
##
@@ -47,6 +47,33 @@ public void testHyphenationWithDictionary() throws Exception 
{
 new int[] {1, 1, 1, 1, 1, 1, 1, 1, 0, 0});
   }
 
+  /**
+   * just tests that the two no configuration options are correctly processed 
tests for the
+   * functionality are part of {@link TestCompoundWordTokenFilter}
+   */
+  public void testLucene8183() throws Exception {
+Reader reader = new StringReader("basketballkurv");
+TokenStream stream = new MockTokenizer(MockTokenizer.WHITESPACE, false);
+((Tokenizer) stream).setReader(reader);
+stream =
+tokenFilterFactory(
+"HyphenationCompoundWord",
+"hyphenator",
+"da_UTF8.xml",

Review Comment:
   This dictionary is mentioned in the original issue. Should I change to a 
german dictionary?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter merged pull request #12447: hunspell: speed up the dictionary enumeration

2023-07-18 Thread via GitHub


donnerpeter merged PR #12447:
URL: https://github.com/apache/lucene/pull/12447


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] MartinDemberger commented on a diff in pull request #12437: LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches

2023-07-18 Thread via GitHub


MartinDemberger commented on code in PR #12437:
URL: https://github.com/apache/lucene/pull/12437#discussion_r1267286254


##
lucene/CHANGES.txt:
##
@@ -65,6 +65,8 @@ New Features
 * LUCENE-10626 Hunspell: add tools to aid dictionary editing:
   analysis introspection, stem expansion and stem/flag suggestion (Peter 
Gromov)
 
+* Lucene-8183: Added the abbility to get noSubMatches and noOverlappingMatches 
in HyphenationCompoundWordFilter (Martin Demberger, original from Rupert 
Westenthaler)

Review Comment:
   I have moved it. Is this the correct position?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] MartinDemberger commented on pull request #12437: LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches

2023-07-18 Thread via GitHub


MartinDemberger commented on PR #12437:
URL: https://github.com/apache/lucene/pull/12437#issuecomment-1640968390

   This is my first PR for lucene. Thank you for your patience and help. If I 
can improve something please let me know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] MartinDemberger commented on issue #9231: HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled [LUCENE-8183]

2023-07-18 Thread via GitHub


MartinDemberger commented on issue #9231:
URL: https://github.com/apache/lucene/issues/9231#issuecomment-1640974145

   > Does this also fix #4096 ?
   
   I'm sorry but no. #4096 notifies DictionaryCompoundWordTokenFilter but this 
one is about HyphenationCompoundWordTokenFilter
   Maybe the change can be adopted but currently I don't have time and knowlege 
to do this. Maybe I have a look on it in a few weeks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #12437: LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches

2023-07-18 Thread via GitHub


uschindler commented on code in PR #12437:
URL: https://github.com/apache/lucene/pull/12437#discussion_r1267297167


##
lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/TestHyphenationCompoundWordTokenFilterFactory.java:
##
@@ -47,6 +47,33 @@ public void testHyphenationWithDictionary() throws Exception 
{
 new int[] {1, 1, 1, 1, 1, 1, 1, 1, 0, 0});
   }
 
+  /**
+   * just tests that the two no configuration options are correctly processed 
tests for the
+   * functionality are part of {@link TestCompoundWordTokenFilter}
+   */
+  public void testLucene8183() throws Exception {
+Reader reader = new StringReader("basketballkurv");
+TokenStream stream = new MockTokenizer(MockTokenizer.WHITESPACE, false);
+((Tokenizer) stream).setReader(reader);
+stream =
+tokenFilterFactory(
+"HyphenationCompoundWord",
+"hyphenator",
+"da_UTF8.xml",

Review Comment:
   The problem with the dictionary is the license. So we should not add the 
originals. If this hacky small Sixt works for the test all is fine.



##
lucene/CHANGES.txt:
##
@@ -65,6 +65,8 @@ New Features
 * LUCENE-10626 Hunspell: add tools to aid dictionary editing:
   analysis introspection, stem expansion and stem/flag suggestion (Peter 
Gromov)
 
+* Lucene-8183: Added the abbility to get noSubMatches and noOverlappingMatches 
in HyphenationCompoundWordFilter (Martin Demberger, original from Rupert 
Westenthaler)

Review Comment:
   You just moved it to enhancement. Please move it down after the 9.8 version 
because I will merge this later to this branch
So fix Version will be Luce e 9.8 not 10.0.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #12437: LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches

2023-07-18 Thread via GitHub


uschindler commented on code in PR #12437:
URL: https://github.com/apache/lucene/pull/12437#discussion_r1267299166


##
lucene/CHANGES.txt:
##
@@ -65,6 +65,8 @@ New Features
 * LUCENE-10626 Hunspell: add tools to aid dictionary editing:
   analysis introspection, stem expansion and stem/flag suggestion (Peter 
Gromov)
 
+* Lucene-8183: Added the abbility to get noSubMatches and noOverlappingMatches 
in HyphenationCompoundWordFilter (Martin Demberger, original from Rupert 
Westenthaler)

Review Comment:
   You just moved it to enhancement. Please move it down after the 9.8 version 
because I will merge this later to this branch
So fix Version will be Lucene 9.8 not 10.0.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] asubbu90 opened a new issue, #12449: byte to int in TruncateTokenFilterFactory to TruncateTokenFilter

2023-07-18 Thread via GitHub


asubbu90 opened a new issue, #12449:
URL: https://github.com/apache/lucene/issues/12449

   ### Description
   
   TruncateTokenFilterFactory class parses PREFIX_LENGTH_KEY value as Byte 
which goes upto 127 and then is stored in prefixLength attribute. 
TruncateTokenFilter class expects the argument in int which has a bigger range 
than byte. Any value greater than 127 throws a exception while being parsed as 
Byte in the TruncateTokenFilterFactory class.
   
   I didnt see any documentation in the TruncateTokenFilterFactory class that 
this value should be less than 128. 
   
   ### Version and environment details
   
   Lucene 9.7.0 . Also verified in latest main code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org