[GitHub] [lucene] wjp719 commented on a diff in pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-09-21 Thread GitBox


wjp719 commented on code in PR #687:
URL: https://github.com/apache/lucene/pull/687#discussion_r976173487


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java:
##
@@ -214,12 +220,166 @@ public int count(LeafReaderContext context) throws 
IOException {
 };
   }
 
+  /**
+   * Returns the first document whose packed value is greater than or equal 
(if allowEqual is true)
+   * to the provided packed value or -1 if all packed values are smaller than 
the provided one,
+   */
+  public final int nextDoc(PointValues values, byte[] packedValue, boolean 
allowEqual)
+  throws IOException {
+assert values.getNumDimensions() == 1;
+final int bytesPerDim = values.getBytesPerDimension();
+final ByteArrayComparator comparator = 
ArrayUtil.getUnsignedComparator(bytesPerDim);
+final Predicate biggerThan =
+testPackedValue -> {
+  if (allowEqual) {
+if (comparator.compare(testPackedValue, 0, packedValue, 0) < 0) {
+  return false;
+}
+  } else {
+if (comparator.compare(testPackedValue, 0, packedValue, 0) <= 0) {
+  return false;
+}
+  }
+  return true;
+};
+return nextDoc(values.getPointTree(), biggerThan);
+  }
+
+  private int nextDoc(PointValues.PointTree pointTree, Predicate 
biggerThan)
+  throws IOException {
+if (biggerThan.test(pointTree.getMaxPackedValue()) == false) {
+  // doc is before us
+  return -1;
+} else if (pointTree.moveToChild()) {
+  // navigate down
+  do {
+final int doc = nextDoc(pointTree, biggerThan);
+if (doc != -1) {
+  return doc;
+}
+  } while (pointTree.moveToSibling());
+  pointTree.moveToParent();
+  return -1;
+} else {
+  // doc is in this leaf
+  final int[] doc = {-1};
+  pointTree.visitDocValues(
+  new IntersectVisitor() {
+@Override
+public void visit(int docID) {
+  throw new AssertionError("Invalid call to visit(docID)");
+}
+
+@Override
+public void visit(int docID, byte[] packedValue) {
+  if (doc[0] == -1 && biggerThan.test(packedValue)) {
+doc[0] = docID;
+  }
+}
+
+@Override
+public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
+  return Relation.CELL_CROSSES_QUERY;
+}
+  });
+  return doc[0];
+}
+  }
+
+  private boolean matchNone(PointValues points, byte[] queryLowerPoint, byte[] 
queryUpperPoint)
+  throws IOException {
+final ByteArrayComparator comparator =
+ArrayUtil.getUnsignedComparator(points.getBytesPerDimension());
+for (int dim = 0; dim < points.getNumDimensions(); dim++) {
+  int offset = dim * points.getBytesPerDimension();
+  if (comparator.compare(points.getMinPackedValue(), offset, 
queryUpperPoint, offset) > 0
+  || comparator.compare(points.getMaxPackedValue(), offset, 
queryLowerPoint, offset) < 0) {
+return true;
+  }
+}
+return false;
+  }
+
+  private boolean matchAll(PointValues points, byte[] queryLowerPoint, byte[] 
queryUpperPoint)
+  throws IOException {
+final ByteArrayComparator comparator =
+ArrayUtil.getUnsignedComparator(points.getBytesPerDimension());
+for (int dim = 0; dim < points.getNumDimensions(); dim++) {
+  int offset = dim * points.getBytesPerDimension();
+  if (comparator.compare(points.getMinPackedValue(), offset, 
queryUpperPoint, offset) > 0) {
+return false;
+  }
+  if (comparator.compare(points.getMaxPackedValue(), offset, 
queryLowerPoint, offset) < 0) {
+return false;
+  }
+  if (comparator.compare(points.getMinPackedValue(), offset, 
queryLowerPoint, offset) < 0
+  || comparator.compare(points.getMaxPackedValue(), offset, 
queryUpperPoint, offset) > 0) {
+return false;
+  }
+}
+return true;
+  }
+
+  private BoundedDocIdSetIterator getDocIdSetIteratorOrNullFromBkd(
+  LeafReaderContext context, DocIdSetIterator delegate) throws IOException 
{
+Sort indexSort = context.reader().getMetaData().getSort();
+if (indexSort != null
+&& indexSort.getSort().length > 0
+&& indexSort.getSort()[0].getField().equals(field)
+&& indexSort.getSort()[0].getReverse() == false) {
+  PointValues points = context.reader().getPointValues(field);
+  if (points == null) {
+return null;
+  }
+
+  // Each doc that has points has exactly one point.
+  if (points.size() == points.getDocCount()) {
+if (points.getDocCount() == context.reader().maxDoc()) {
+  delegate = null;
+}
+
+byte[] queryLowerPoint = LongPoint.pack(lowerValue).bytes;

Review Comment:
   I check bytes per dimensions 8 or 4, and pack

[GitHub] [lucene] uschindler commented on pull request #912: MR-JAR rewrite of MMapDirectory with JDK-19 preview Panama APIs (>= JDK-19-ea+23)

2022-09-21 Thread GitBox


uschindler commented on PR #912:
URL: https://github.com/apache/lucene/pull/912#issuecomment-1253365297

   JDK 19 was released, I am working on the Toolchain support to support the 
compilation of the MR-JAR. At moment, the code commented out does not yet work, 
as AdoptOpenJDK / Temurin did not do a release yet: 
https://github.com/adoptium/adoptium/issues/170


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #11793: Prevent PointValues from returning null for ghost fields

2022-09-21 Thread GitBox


javanna commented on code in PR #11793:
URL: https://github.com/apache/lucene/pull/11793#discussion_r976257408


##
lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java:
##
@@ -104,28 +104,28 @@ public abstract class NumericLeafComparator implements 
LeafFieldComparator {
 
 public NumericLeafComparator(LeafReaderContext context) throws IOException 
{
   this.docValues = getNumericDocValues(context, field);
-  this.pointValues = canSkipDocuments ? 
context.reader().getPointValues(field) : null;
-  if (pointValues != null) {
-FieldInfo info = context.reader().getFieldInfos().fieldInfo(field);
-if (info == null || info.getPointDimensionCount() == 0) {
-  throw new IllegalStateException(
-  "Field "
-  + field
-  + " doesn't index points according to FieldInfos yet returns 
non-null PointValues");
-} else if (info.getPointDimensionCount() > 1) {
-  throw new IllegalArgumentException(
-  "Field " + field + " is indexed with multiple dimensions, 
sorting is not supported");
-} else if (info.getPointNumBytes() != bytesCount) {
-  throw new IllegalArgumentException(
-  "Field "
-  + field
-  + " is indexed with "
-  + info.getPointNumBytes()
-  + " bytes per dimension, but "
-  + NumericComparator.this
-  + " expected "
-  + bytesCount);
-}
+  FieldInfo info = context.reader().getFieldInfos().fieldInfo(field);
+  if (info == null || info.getPointDimensionCount() == 0) {
+throw new IllegalStateException(

Review Comment:
   yes this is wrong, I pushed a fix: the idea is that we don't need to check 
consistency between FieldInfo and getPointValues here, hence the first 
exception can go away. We check fieldinfo and we let that drive our decisions, 
and retrieve points based on what field info says.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shaie merged pull request #11775: Minor refactoring and cleanup to taxonomy index code

2022-09-21 Thread GitBox


shaie merged PR #11775:
URL: https://github.com/apache/lucene/pull/11775


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shaie opened a new pull request, #11798: Minor refactoring and cleanup to taxonomy index code

2022-09-21 Thread GitBox


shaie opened a new pull request, #11798:
URL: https://github.com/apache/lucene/pull/11798

   ### Description
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] thongnt99 opened a new issue, #11799: Indexing method for learned sparse retrieval

2022-09-21 Thread GitBox


thongnt99 opened a new issue, #11799:
URL: https://github.com/apache/lucene/issues/11799

   ### Description
   
   Recent learned sparse retrieval methods 
([Splade](https://github.com/naver/splade), 
[uniCOIL](https://github.com/castorini/pyserini/blob/master/docs/experiments-unicoil.md))
 were trained to generate impact score directly (replacing tf-idf score).  
   For each document, they will generate a json file with terms and weights,  
e.g. `{";": 80, "the": 161, "of": 85, "and": 27, "to": 24, "was": 47, "as": 27, 
"their": 96, "what": 40, "over": 123, "only": 123, "important": 186, "project": 
208, "success": 215, "meant": 131, "lives": 140, "presence": 180, "scientific": 
200, "communication": 235, "thousands": 142, "hundreds": 144, "truly": 170, 
"hanging": 141, "cloud": 187, "engineers": 127, "achievement": 192, 
"researchers": 137, "innocent": 181, "manhattan": 244, "impressive": 191, 
"equally": 163, "##rated": 132, "minds": 137, "atomic": 214, "amid": 201, 
"##lite": 120, "intellect": 202, "ob": 140}}`
   Can we make a new feature that could index this type of document 
efficiently? 
   The current [work-around 
](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/collection/JsonVectorCollection.java)
 I am aware of is to create a fake document by repeating the terms: e.g., `"the 
the the the  of of of of of "`
   However, this way is not very efficient if the impact score gets bigger and 
also it requires impact score quantization before indexing. 
   I think it would be very useful for many people if we can index the json 
files directly with float impact scores. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #912: MR-JAR rewrite of MMapDirectory with JDK-19 preview Panama APIs (>= JDK-19-ea+23)

2022-09-21 Thread GitBox


uschindler commented on PR #912:
URL: https://github.com/apache/lucene/pull/912#issuecomment-1253646592

   Current output:
   
   ```
   Starting a Gradle Daemon (subsequent builds will be faster)
   Directory 'C:\Users\Uwe Schindler\.gradle\daemon\7.3.3\(custom paths)' 
(system property 'org.gradle.java.installations.paths') used for java 
installations does not exist
   
   > Task :errorProneSkipped
   WARNING: errorprone disabled (skipped on builds not running inside CI 
environments, pass -Pvalidation.errorprone=true to enable)
   
   > Task :lucene:core:compileMain19Java FAILED
   
   FAILURE: Build failed with an exception.
   
   * What went wrong:
   Execution failed for task ':lucene:core:compileMain19Java'.
   > Error while evaluating property 'javaCompiler' of task 
':lucene:core:compileMain19Java'
  > Failed to calculate the value of task ':lucene:core:compileMain19Java' 
property 'javaCompiler'.
 > Unable to download toolchain matching these requirements: 
{languageVersion=19, vendor=ADOPTOPENJDK, implementation=vendor-specific}
> Unable to download toolchain. This might indicate that the 
combination (version, architecture, release/early access, ...) for the 
requested JDK is not available.
   > Could not read 
'https://api.adoptopenjdk.net/v3/binary/latest/19/ga/windows/x64/jdk/hotspot/normal/adoptopenjdk'
 as it does not exist.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] itygh commented on pull request #2670: Backport a few upgrades to branch_8_11

2022-09-21 Thread GitBox


itygh commented on PR #2670:
URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1253686367

   这是来自QQ邮箱的假期自动回复邮件。您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy commented on pull request #2670: Backport a few upgrades to branch_8_11

2022-09-21 Thread GitBox


janhoy commented on PR #2670:
URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1253712939

   Most of these should be safe as they are pure bugfix version upgrades. I see 
one `.java` class touched, which is also safe.
   
   Can you try to highlight which part of this PR is the most "risky"? I 
suppose it would be the 5 new jars/parsers pulled in by Tika? Or the addition 
of new Calcite deps we have not had before?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] risdenk commented on pull request #2670: Backport a few upgrades to branch_8_11

2022-09-21 Thread GitBox


risdenk commented on PR #2670:
URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1253717437

   > Can you try to highlight which part of this PR is the most "risky"? I 
suppose it would be the 5 new jars/parsers pulled in by Tika? Or the addition 
of new Calcite deps we have not had before?
   
   Yea it would be the dependency upgrades. They could be incompatible with 
each other or have other subtle bugs. The guava upgrade is potentially risky - 
but will see when I run through all the tests shortly. 
   
   I don't think any of these are terrible from a risk perspective.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on issue #11799: Indexing method for learned sparse retrieval

2022-09-21 Thread GitBox


mocobeta commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1253724856

   In general I'm +1 for supporting learned sparse retrieval, though, I think 
it would not be so trivial as it looks.
   
   For a starter perhaps we could utilize terms' payloads to tweak the weights 
instead of modifying the indexing chain... but there may be some overheads in 
score calculation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gcbaptista opened a new issue, #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)

2022-09-21 Thread GitBox


gcbaptista opened a new issue, #11800:
URL: https://github.com/apache/lucene/issues/11800

   ### Description
   
   Since release `9.1.0`, Lucene's SyntaxParser have been uncapable to parse 
`@` in a query, throwing a Syntax Error (`INVALID_SYNTAX_CANNOT_PARSE`).
   Version `9.0.0` is the last I tested that still can parse this character.
   
   Two similar examples that throw `INVALID_SYNTAX_CANNOT_PARSE` in `9.1.0`, 
but not in `9.0.0`:
- `\\ an@tomy`
- `\\ @natomy`
   
   Leaving some log dump here:
   `Syntax Error, cannot parse \\ an@tomy:  
   INVALID_SYNTAX_CANNOT_PARSE: Syntax Error, cannot parse \\ an@tomy:  
at 
org.apache.lucene.queryparser.flexible.standard.parser.StandardSyntaxParser.generateParseException(StandardSyntaxParser.java:2093)
at 
org.apache.lucene.queryparser.flexible.standard.parser.StandardSyntaxParser.jj_consume_token(StandardSyntaxParser.java:1961)
at 
org.apache.lucene.queryparser.flexible.standard.parser.StandardSyntaxParser.TopLevelQuery(StandardSyntaxParser.java:115)
at 
org.apache.lucene.queryparser.flexible.standard.parser.StandardSyntaxParser.parse(StandardSyntaxParser.java:92)
at 
org.apache.lucene.queryparser.flexible.core.QueryParserHelper.parse(QueryParserHelper.java:214)
at 
org.apache.lucene.queryparser.flexible.standard.StandardQueryParser.parse(StandardQueryParser.java:280)
   ...`
   
   
   
   ### Version and environment details
   
   - JDK: temurin-17.0.3
   - OS: MacOS Monterey 12.6 and alpine-java17:latest (docker image)
   - Lucene: 9.1.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug

2022-09-21 Thread GitBox


kotman12 commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1253843509

   > Hi @kotman12 . Sorry for the delay. I'm not that familiar with this part 
of the codebase but I think I see what's happening and how you managed to fix 
it. Looks good to me. It'd be good to run this pipeline over a larger text base 
to make sure there are no surprises and regressions.
   
   Hi @dweiss  .. no worries, thanks for taking a look.
   
   I have made the changes you suggested. I agree that it would be good to test 
this with a large and more varied text base. I did incorporate these changes 
separately as a patch in my current project which is how I found that my 
initial proposal actually had a regression (it threw an NPE in case of empty 
input). It's possible there other bugs that I haven't uncovered .. did you have 
any particular test dataset in mind?
   
   
---
   Totally separately I noticed that in the case of stacking two or more 
sentence-aware token filters, sentence token iteration and attribute cloning 
are performed separately across the different filters (this is the case in the 
current implementation as well). This work could probably be done in a 
dedicated `SentenceExtractionFilter` which could then pass a reference to a 
_single_ `sentenceTokenAttributes` to all downstream filters that rely on 
sentence-level analysis (perhaps such a reference could be stashed in the 
proposed `SentenceAttribute`). Also, following this reasoning a bit further, it 
is possible to conceive future implementations that rely on passing an 
arbitrary subset of adjacent tokens together to some analysis function, in 
which case the use of the term `Sentence` would become a misnomer. 
   
   However, addressing these issues would be more effort so I wanted to check 
if it is even worth it to explore. I am starting to suspect that this package 
is not really used a lot because otherwise one would expect these bugs to have 
been caught sooner given the configuration is in the documentation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] risdenk commented on pull request #2670: Backport a few upgrades to branch_8_11

2022-09-21 Thread GitBox


risdenk commented on PR #2670:
URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1253920621

   Well I'm struggling to run tests on my M1 Mac. They don't like JDK 8 on M1. 
I'll have to test this separately on another computer or VM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11799: Indexing method for learned sparse retrieval

2022-09-21 Thread GitBox


rmuir commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1253945883

   You can use `TermFrequencyAttribute` in the analysis chain to set the 
frequency directly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy commented on pull request #2670: Backport a few upgrades to branch_8_11

2022-09-21 Thread GitBox


janhoy commented on PR #2670:
URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1253988574

   Should there have been an optional GitHub Action to run all tests in a PR? 
Something you could activate on demand?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] risdenk commented on pull request #2670: Backport a few upgrades to branch_8_11

2022-09-21 Thread GitBox


risdenk commented on PR #2670:
URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1253995175

   > Should there have been an optional GitHub Action to run all tests in a PR? 
Something you could activate on demand?
   
   eh its only an issue with 8.11 and I have ways to do it just forgot about 
jdk 8 and M1 being an issue :D I ran into this when testing 8.11.2 release and 
@madrob helped explain what was going on w/ M1. 
   
   it would be cool but its not something I would spend a ton of time on. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)

2022-09-21 Thread GitBox


rmuir commented on issue #11800:
URL: https://github.com/apache/lucene/issues/11800#issuecomment-1254008994

   not a bug, but related to new features added to the parser. see the 
associated message in `MIGRATE.txt`:
   
   ```
   ## Minor syntactical changes in StandardQueryParser (Lucene 9.1)
   
   LUCENE-10223 adds interval functions and min-should-match support to 
StandardQueryParser. This
   means that interval function prefixes ("fn:") and the '@' character after 
parentheses will
   parse differently than before. If you need the exact previous behavior, 
clone the StandardSyntaxParser from the previous version of Lucene and create a 
custom query parser
   with that parser.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)

2022-09-21 Thread GitBox


dweiss commented on issue #11800:
URL: https://github.com/apache/lucene/issues/11800#issuecomment-1254016674

   Also, please note that you can quote the ampersand in terms - this will 
behave like before. I don't think it's a bug, sorry it caused you trouble but 
the new functionality is worth it (try it!).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #11734: Fix repeating token sentence boundary bug

2022-09-21 Thread GitBox


dweiss commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1254019222

   > I am starting to suspect that this package is not really used a lot 
because otherwise one would expect these bugs to have been caught sooner given 
the configuration is in the documentation.
   
   You're very like right. There's a lot of old cruft that's mostly unused 
lying around. Please add lucene/CHANGES.txt entry (under 9.5.0) and I think 
it's done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-21 Thread GitBox


gautamworah96 commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r975969527


##
lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class ByteTrackingIndexOutput extends IndexOutput {
+
+  private final IndexOutput output;
+  private final AtomicLong byteTracker;
+
+  protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong 
byteTracker) {
+super(
+"Byte tracking wrapper for: " + output.getName(),
+"ByteTrackingIndexOutput{" + output.getName() + "}");
+this.output = output;
+this.byteTracker = byteTracker;
+  }
+
+  @Override
+  public void writeByte(byte b) throws IOException {
+byteTracker.incrementAndGet();
+output.writeByte(b);
+  }
+
+  @Override
+  public void writeBytes(byte[] b, int offset, int length) throws IOException {
+byteTracker.addAndGet(length);
+output.writeBytes(b, offset, length);
+  }
+
+  @Override
+  public void close() throws IOException {
+output.close();
+  }
+
+  @Override
+  public long getFilePointer() {
+return output.getFilePointer();
+  }
+
+  @Override
+  public long getChecksum() throws IOException {
+return output.getChecksum();
+  }
+
+  public String getWrappedName() {

Review Comment:
   Why do we need `getWrappedName` and `getWrappedToString`? They are already 
defined in the parent class



##
lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class ByteTrackingIndexOutput extends IndexOutput {
+
+  private final IndexOutput output;
+  private final AtomicLong byteTracker;
+
+  protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong 
byteTracker) {
+super(
+"Byte tracking wrapper for: " + output.getName(),
+"ByteTrackingIndexOutput{" + output.getName() + "}");
+this.output = output;
+this.byteTracker = byteTracker;
+  }
+
+  @Override
+  public void writeByte(byte b) throws IOException {
+byteTracker.incrementAndGet();

Review Comment:
   We are not using the returned value here. Just use `increment` maybe? Same 
for the `writeBytes` method?



##
lucene/core/src/java/org/apache/lucene/store/WriteAmplificationTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distr

[GitHub] [lucene] shaie merged pull request #11798: Minor refactoring and cleanup to taxonomy index code

2022-09-21 Thread GitBox


shaie merged PR #11798:
URL: https://github.com/apache/lucene/pull/11798


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on issue #11799: Indexing method for learned sparse retrieval

2022-09-21 Thread GitBox


jtibshirani commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254070746

   +1 from me too, it'd be great to think through how to support this. Could 
you explain how the query side would look? Are the queries also sparse vectors 
with custom impacts?
   
   As a note, we have a `FeatureField` field type that accepts key-value pairs 
and stores the value in `TermFrequencyAttribute`. It's designed to help 
incorporate other storing signals like popularity, page rank, etc. It may not 
be exactly what we want for this use case, but it could provide some 
inspiration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11797: DrillSideways uses advance instead of next when multiple dims miss

2022-09-21 Thread GitBox


gsmiller commented on PR #11797:
URL: https://github.com/apache/lucene/pull/11797#issuecomment-1254074997

   Cancelling this out as I've realized we can do even better. I'll post a new 
PR with a few more optimizations baked in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller closed pull request #11797: DrillSideways uses advance instead of next when multiple dims miss

2022-09-21 Thread GitBox


gsmiller closed pull request #11797: DrillSideways uses advance instead of next 
when multiple dims miss
URL: https://github.com/apache/lucene/pull/11797


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stevenschlansker commented on issue #8553: Add AccessController.doPrivileged around all calls of Class#getResource() and Class#getResourceAsStream() [LUCENE-7502]

2022-09-21 Thread GitBox


stevenschlansker commented on issue #8553:
URL: https://github.com/apache/lucene/issues/8553#issuecomment-1254096210

   AccessController is now deprecated for removal, as is the security manager. 
Is this issue still relevant?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stevenschlansker commented on issue #6534: Classloader issues when running Lucene under a java SecurityManager [LUCENE-5471]

2022-09-21 Thread GitBox


stevenschlansker commented on issue #6534:
URL: https://github.com/apache/lucene/issues/6534#issuecomment-1254097488

   SecurityManager is now deprecate for removal, so this issue might no longer 
be relevant going forward.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stevenschlansker opened a new issue, #11801: Remove usage of SecurityManager and AccessController

2022-09-21 Thread GitBox


stevenschlansker opened a new issue, #11801:
URL: https://github.com/apache/lucene/issues/11801

   ### Description
   
   Java is removing the SecurityManager and AccessController.
   
   Running Lucene build under Java 17 emits a lot of warnings:
   
   ```
   WARNING: A command line option has enabled the Security Manager
   WARNING: The Security Manager is deprecated and will be removed in a future 
release
   ```
   
   In a future release, this will break the build. Lucene should remove all 
uses of SecurityManager and AccessController to work in future Java versions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on issue #11799: Indexing method for learned sparse retrieval

2022-09-21 Thread GitBox


msokolov commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254100366

   Using `TermFrequencyAttribute` to customize the term frequencies you can 
then create a Query in the normal way and compute BM25 using `b==0` then I 
think you will directly control the similarity scores. Or you might want to 
write a custom Similarity to be a bit more efficient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11801: Remove usage of SecurityManager and AccessController

2022-09-21 Thread GitBox


rmuir commented on issue #11801:
URL: https://github.com/apache/lucene/issues/11801#issuecomment-1254102841

   We use it to sandbox our tests, so we shouldn't remove it without 
replacement. Otherwise tests might interfere with each other which is not fun 
to debug.
   
   Additionally as a library, we need to support these APIs properly for 
applications that use the security manager (e.g. elasticsearch). We should 
support it as long as possible to give such apps time to "replace" as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11801: Remove usage of SecurityManager and AccessController

2022-09-21 Thread GitBox


rmuir commented on issue #11801:
URL: https://github.com/apache/lucene/issues/11801#issuecomment-1254114559

   for the tests i have a couple ideas:
   * use forbidden-apis more aggressively to statically prevent tests from 
doing stuff we don't want. Actually more powerful for our use-case in a lot of 
ways, e.g. we should ban `Thread.sleep()` :)
   * add `mockfs` layer to enforce tests only write to their own unique 
directory. Enforcing the filesystem access is isolated is key, but this should 
work almost as well as security manager (we don't have many dependencies using 
the old `java.io` etc that would bypass it)
   
   for the situation of being a library and needing to support apps that still 
rely on securitymanager, I don't see any immediate fix. because the only way to 
know the security code works, is to run our tests with security manager 
enabled...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] thongnt99 commented on issue #11799: Indexing method for learned sparse retrieval

2022-09-21 Thread GitBox


thongnt99 commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254119695

   @jtibshirani  The query side is same as document side, which is a dictionary 
of terms and weights. To make it compatible with Lucene, people just repeat the 
terms with its frequency. This is fine because queries are usually much 
shorter. 
   Yes, FeatureField is something similar, but we want a single Field 
containing a list of key-value pairs or a json formatted. 
   @msokolov @rmuir @mocobeta: I fould 
[this](https://github.com/apache/lucene/blob/475fbd0bdde31c6a2ae62c59505cf9e8becd50e4/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.java),
 which could somehow achieves what we want;  But I think it is not so flexible, 
we need to turn the json file into a token stream formatted as:  
[..] ...  I think this step is redundant. Can 
we just load the json file directly? For this I think we might have to move 
away from TokenStream pipeline?  
   What do you think? Your thought is very much appreciated as I am not very 
familiar with Lucene. 
   
   We can form a group to do this if you guys are interested in. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stevenschlansker commented on issue #11801: Remove usage of SecurityManager and AccessController

2022-09-21 Thread GitBox


stevenschlansker commented on issue #11801:
URL: https://github.com/apache/lucene/issues/11801#issuecomment-1254120085

   > for the situation of being a library and needing to support apps that 
still rely on securitymanager, I don't see any immediate fix. because the only 
way to know the security code works, is to run our tests with security manager 
enabled...
   
   Yes, this is going to be a challenge. Some apps will want to be on Java 
Latest, which will not even have the types defined. Other apps will still run 
on Java 8, even 20 years later ;) , and supporting both will be tricky.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-21 Thread GitBox


gautamworah96 commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r976900966


##
lucene/core/src/java/org/apache/lucene/store/WriteAmplificationTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/** {@link FilterDirectory} that tracks write amplification factor */
+public final class WriteAmplificationTrackingDirectoryWrapper extends 
FilterDirectory {
+
+  private final AtomicLong flushedBytes = new AtomicLong();
+  private final AtomicLong mergedBytes = new AtomicLong();
+
+  /**
+   * Sole constructor, typically called from sub-classes.
+   *
+   * @param in input Directory
+   */
+  public WriteAmplificationTrackingDirectoryWrapper(Directory in) {
+super(in);
+  }
+
+  @Override
+  public IndexOutput createOutput(String name, IOContext context) throws 
IOException {
+IndexOutput output = in.createOutput(name, context);
+IndexOutput byteTrackingIndexOutput;
+if (context.context.equals(IOContext.Context.FLUSH)) {

Review Comment:
   `context.context` is a bit confusing. Lets rename the method param?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11801: Remove usage of SecurityManager and AccessController

2022-09-21 Thread GitBox


rmuir commented on issue #11801:
URL: https://github.com/apache/lucene/issues/11801#issuecomment-1254139967

   I'm not worried, according to the JEP: https://openjdk.org/jeps/411
   ```
   In feature releases after Java 18, we will degrade other Security Manager 
APIs so that they remain in place but with limited or no functionality. For 
example, we may revise AccessController::doPrivileged simply to run the given 
action, or revise System::getSecurityManager always to return null. This will 
allow libraries that support the Security Manager and were compiled against 
previous Java releases to continue to work without change or even 
recompilation. We expect to remove the APIs once the compatibility risk of 
doing so declines to an acceptable level.
   ```
   
   So it seems these APIs will become "no-ops" first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kotman12 opened a new pull request, #11802: fix sentence iteration in opennlp package

2022-09-21 Thread GitBox


kotman12 opened a new pull request, #11802:
URL: https://github.com/apache/lucene/pull/11802

   Fix sentence boundary detection bug in case of repeating tokens (i.e. while 
using OpenNLP analysis chain in conjunction with a KeywordRepeatFilter) by 
keeping track of the sentence index and looking ahead one token. Move inner 
sentence iteration to a component to be shared by the sentence-aware OpenNLP 
filters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug

2022-09-21 Thread GitBox


kotman12 commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1254253617

   @dweiss I updated CHANGES.txt but blew up this PR and messed up the history 
in the process. If you prefer this is a more concise PR with the relevant 
changes patched in -> https://github.com/apache/lucene/pull/11802


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-21 Thread GitBox


mdmarshmallow commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r977019915


##
lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class ByteTrackingIndexOutput extends IndexOutput {
+
+  private final IndexOutput output;
+  private final AtomicLong byteTracker;
+
+  protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong 
byteTracker) {
+super(
+"Byte tracking wrapper for: " + output.getName(),
+"ByteTrackingIndexOutput{" + output.getName() + "}");
+this.output = output;
+this.byteTracker = byteTracker;
+  }
+
+  @Override
+  public void writeByte(byte b) throws IOException {
+byteTracker.incrementAndGet();
+output.writeByte(b);
+  }
+
+  @Override
+  public void writeBytes(byte[] b, int offset, int length) throws IOException {
+byteTracker.addAndGet(length);
+output.writeBytes(b, offset, length);
+  }
+
+  @Override
+  public void close() throws IOException {
+output.close();
+  }
+
+  @Override
+  public long getFilePointer() {
+return output.getFilePointer();
+  }
+
+  @Override
+  public long getChecksum() throws IOException {
+return output.getChecksum();
+  }
+
+  public String getWrappedName() {

Review Comment:
   `super#getName()` and `super#toString()` would give us the name and String 
representation of this class. I made these in case someone wants the access the 
wrapped class's name and String representation. With that being said, I'm not 
really sure how useful these two methods would be. I could remove them?



##
lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class ByteTrackingIndexOutput extends IndexOutput {
+
+  private final IndexOutput output;
+  private final AtomicLong byteTracker;
+
+  protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong 
byteTracker) {
+super(
+"Byte tracking wrapper for: " + output.getName(),
+"ByteTrackingIndexOutput{" + output.getName() + "}");
+this.output = output;
+this.byteTracker = byteTracker;
+  }
+
+  @Override
+  public void writeByte(byte b) throws IOException {
+byteTracker.incrementAndGet();

Review Comment:
   Unfortunately, there is no pure `increment()` method: 
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/atomic/AtomicLong.html



##
lucene/core/src/java/org/apache/lucene/store/WriteAmplificationTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this fil

[GitHub] [lucene] gsmiller opened a new pull request, #11803: DrillSideways optimizations

2022-09-21 Thread GitBox


gsmiller opened a new pull request, #11803:
URL: https://github.com/apache/lucene/pull/11803

   ### Description
   
   This change makes use of `advance` instead of `next` where possible and 
splits out 1st and 2nd phase checking to avoid match confirmation when 
unnecessary.
   
   Note that I only focused on the `doQueryFirstScoring` implementation here 
and didn't modify the other two scoring approaches. "Progress not perfection" 
and all that (plus, I think we should strongly consider removing these other 
two implementations, but we'd want to benchmark to be certain).
   
   Unfortunately, `luceneutil` doesn't have dedicated drill sideways 
benchmarks, but some benchmarks on our internal software that makes use of 
drill sideways showed a +2% QPS improvement and no obvious regressions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-09-21 Thread GitBox


wjp719 commented on PR #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1254416217

   > Thanks, this looks good to me! Can you add a CHANGES entry with your name 
under 9.5?
   
   Thanks a lot, I have added the Change entry.
   
   And This PR has a limitation that only index sorting by ascend order can use 
bkd binary search to get min/max docId. The main reason is that bkd now sorts 
point  by two dimension (point value, docId) both in ascend order. If doc is 
index-sorted by ascend order, then all the docId of all leaf point will be 
monotone increasing, so we can use bkd binay search.
   
   In our local work, if doc is index-sorted by descend order, we modify the 
bkd sorting logic by (point value in ascend order , docId in descend order), so 
that all the docId of all leaf point will be monotone decreasing, then we can 
use bkd binay search again. 
   
   So May I open another PR to add an option that BKD can sort by (point value 
in ascend order , docId in descend order)? then the bkd binary search can work 
in both ascend/descend index sorting


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] risdenk commented on pull request #2670: Backport a few upgrades to branch_8_11

2022-09-21 Thread GitBox


risdenk commented on PR #2670:
URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1254471773

   Took a few runs but got a pass:
   
   ```
   BUILD SUCCESSFUL
   Total time: 75 minutes 41 seconds
   ```
   
   all the failures didn't reproduce when run independently so I don't think 
they had anything to do with these fixes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] risdenk merged pull request #2670: Backport a few upgrades to branch_8_11

2022-09-21 Thread GitBox


risdenk merged PR #2670:
URL: https://github.com/apache/lucene-solr/pull/2670


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11793: Prevent PointValues from returning null for ghost fields

2022-09-21 Thread GitBox


jpountz commented on PR #11793:
URL: https://github.com/apache/lucene/pull/11793#issuecomment-1254593758

   Test failures suggest CheckIndex needs to have its expectations adjusted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-09-21 Thread GitBox


jpountz commented on PR #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1254600722

   I was wondering about descending sorts too! Do we actually need to make this 
configurable on BKD trees, I would rather not add this option and make the 
binary search logic a bit more complex/inefficient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-09-21 Thread GitBox


jpountz merged PR #687:
URL: https://github.com/apache/lucene/pull/687


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org