[GitHub] [lucene] wjp719 commented on a diff in pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
wjp719 commented on code in PR #687: URL: https://github.com/apache/lucene/pull/687#discussion_r976173487 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java: ## @@ -214,12 +220,166 @@ public int count(LeafReaderContext context) throws IOException { }; } + /** + * Returns the first document whose packed value is greater than or equal (if allowEqual is true) + * to the provided packed value or -1 if all packed values are smaller than the provided one, + */ + public final int nextDoc(PointValues values, byte[] packedValue, boolean allowEqual) + throws IOException { +assert values.getNumDimensions() == 1; +final int bytesPerDim = values.getBytesPerDimension(); +final ByteArrayComparator comparator = ArrayUtil.getUnsignedComparator(bytesPerDim); +final Predicate biggerThan = +testPackedValue -> { + if (allowEqual) { +if (comparator.compare(testPackedValue, 0, packedValue, 0) < 0) { + return false; +} + } else { +if (comparator.compare(testPackedValue, 0, packedValue, 0) <= 0) { + return false; +} + } + return true; +}; +return nextDoc(values.getPointTree(), biggerThan); + } + + private int nextDoc(PointValues.PointTree pointTree, Predicate biggerThan) + throws IOException { +if (biggerThan.test(pointTree.getMaxPackedValue()) == false) { + // doc is before us + return -1; +} else if (pointTree.moveToChild()) { + // navigate down + do { +final int doc = nextDoc(pointTree, biggerThan); +if (doc != -1) { + return doc; +} + } while (pointTree.moveToSibling()); + pointTree.moveToParent(); + return -1; +} else { + // doc is in this leaf + final int[] doc = {-1}; + pointTree.visitDocValues( + new IntersectVisitor() { +@Override +public void visit(int docID) { + throw new AssertionError("Invalid call to visit(docID)"); +} + +@Override +public void visit(int docID, byte[] packedValue) { + if (doc[0] == -1 && biggerThan.test(packedValue)) { +doc[0] = docID; + } +} + +@Override +public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) { + return Relation.CELL_CROSSES_QUERY; +} + }); + return doc[0]; +} + } + + private boolean matchNone(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint) + throws IOException { +final ByteArrayComparator comparator = +ArrayUtil.getUnsignedComparator(points.getBytesPerDimension()); +for (int dim = 0; dim < points.getNumDimensions(); dim++) { + int offset = dim * points.getBytesPerDimension(); + if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0 + || comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) { +return true; + } +} +return false; + } + + private boolean matchAll(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint) + throws IOException { +final ByteArrayComparator comparator = +ArrayUtil.getUnsignedComparator(points.getBytesPerDimension()); +for (int dim = 0; dim < points.getNumDimensions(); dim++) { + int offset = dim * points.getBytesPerDimension(); + if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0) { +return false; + } + if (comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) { +return false; + } + if (comparator.compare(points.getMinPackedValue(), offset, queryLowerPoint, offset) < 0 + || comparator.compare(points.getMaxPackedValue(), offset, queryUpperPoint, offset) > 0) { +return false; + } +} +return true; + } + + private BoundedDocIdSetIterator getDocIdSetIteratorOrNullFromBkd( + LeafReaderContext context, DocIdSetIterator delegate) throws IOException { +Sort indexSort = context.reader().getMetaData().getSort(); +if (indexSort != null +&& indexSort.getSort().length > 0 +&& indexSort.getSort()[0].getField().equals(field) +&& indexSort.getSort()[0].getReverse() == false) { + PointValues points = context.reader().getPointValues(field); + if (points == null) { +return null; + } + + // Each doc that has points has exactly one point. + if (points.size() == points.getDocCount()) { +if (points.getDocCount() == context.reader().maxDoc()) { + delegate = null; +} + +byte[] queryLowerPoint = LongPoint.pack(lowerValue).bytes; Review Comment: I check bytes per dimensions 8 or 4, and pack
[GitHub] [lucene] uschindler commented on pull request #912: MR-JAR rewrite of MMapDirectory with JDK-19 preview Panama APIs (>= JDK-19-ea+23)
uschindler commented on PR #912: URL: https://github.com/apache/lucene/pull/912#issuecomment-1253365297 JDK 19 was released, I am working on the Toolchain support to support the compilation of the MR-JAR. At moment, the code commented out does not yet work, as AdoptOpenJDK / Temurin did not do a release yet: https://github.com/adoptium/adoptium/issues/170 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on a diff in pull request #11793: Prevent PointValues from returning null for ghost fields
javanna commented on code in PR #11793: URL: https://github.com/apache/lucene/pull/11793#discussion_r976257408 ## lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java: ## @@ -104,28 +104,28 @@ public abstract class NumericLeafComparator implements LeafFieldComparator { public NumericLeafComparator(LeafReaderContext context) throws IOException { this.docValues = getNumericDocValues(context, field); - this.pointValues = canSkipDocuments ? context.reader().getPointValues(field) : null; - if (pointValues != null) { -FieldInfo info = context.reader().getFieldInfos().fieldInfo(field); -if (info == null || info.getPointDimensionCount() == 0) { - throw new IllegalStateException( - "Field " - + field - + " doesn't index points according to FieldInfos yet returns non-null PointValues"); -} else if (info.getPointDimensionCount() > 1) { - throw new IllegalArgumentException( - "Field " + field + " is indexed with multiple dimensions, sorting is not supported"); -} else if (info.getPointNumBytes() != bytesCount) { - throw new IllegalArgumentException( - "Field " - + field - + " is indexed with " - + info.getPointNumBytes() - + " bytes per dimension, but " - + NumericComparator.this - + " expected " - + bytesCount); -} + FieldInfo info = context.reader().getFieldInfos().fieldInfo(field); + if (info == null || info.getPointDimensionCount() == 0) { +throw new IllegalStateException( Review Comment: yes this is wrong, I pushed a fix: the idea is that we don't need to check consistency between FieldInfo and getPointValues here, hence the first exception can go away. We check fieldinfo and we let that drive our decisions, and retrieve points based on what field info says. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shaie merged pull request #11775: Minor refactoring and cleanup to taxonomy index code
shaie merged PR #11775: URL: https://github.com/apache/lucene/pull/11775 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shaie opened a new pull request, #11798: Minor refactoring and cleanup to taxonomy index code
shaie opened a new pull request, #11798: URL: https://github.com/apache/lucene/pull/11798 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] thongnt99 opened a new issue, #11799: Indexing method for learned sparse retrieval
thongnt99 opened a new issue, #11799: URL: https://github.com/apache/lucene/issues/11799 ### Description Recent learned sparse retrieval methods ([Splade](https://github.com/naver/splade), [uniCOIL](https://github.com/castorini/pyserini/blob/master/docs/experiments-unicoil.md)) were trained to generate impact score directly (replacing tf-idf score). For each document, they will generate a json file with terms and weights, e.g. `{";": 80, "the": 161, "of": 85, "and": 27, "to": 24, "was": 47, "as": 27, "their": 96, "what": 40, "over": 123, "only": 123, "important": 186, "project": 208, "success": 215, "meant": 131, "lives": 140, "presence": 180, "scientific": 200, "communication": 235, "thousands": 142, "hundreds": 144, "truly": 170, "hanging": 141, "cloud": 187, "engineers": 127, "achievement": 192, "researchers": 137, "innocent": 181, "manhattan": 244, "impressive": 191, "equally": 163, "##rated": 132, "minds": 137, "atomic": 214, "amid": 201, "##lite": 120, "intellect": 202, "ob": 140}}` Can we make a new feature that could index this type of document efficiently? The current [work-around ](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/collection/JsonVectorCollection.java) I am aware of is to create a fake document by repeating the terms: e.g., `"the the the the of of of of of "` However, this way is not very efficient if the impact score gets bigger and also it requires impact score quantization before indexing. I think it would be very useful for many people if we can index the json files directly with float impact scores. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #912: MR-JAR rewrite of MMapDirectory with JDK-19 preview Panama APIs (>= JDK-19-ea+23)
uschindler commented on PR #912: URL: https://github.com/apache/lucene/pull/912#issuecomment-1253646592 Current output: ``` Starting a Gradle Daemon (subsequent builds will be faster) Directory 'C:\Users\Uwe Schindler\.gradle\daemon\7.3.3\(custom paths)' (system property 'org.gradle.java.installations.paths') used for java installations does not exist > Task :errorProneSkipped WARNING: errorprone disabled (skipped on builds not running inside CI environments, pass -Pvalidation.errorprone=true to enable) > Task :lucene:core:compileMain19Java FAILED FAILURE: Build failed with an exception. * What went wrong: Execution failed for task ':lucene:core:compileMain19Java'. > Error while evaluating property 'javaCompiler' of task ':lucene:core:compileMain19Java' > Failed to calculate the value of task ':lucene:core:compileMain19Java' property 'javaCompiler'. > Unable to download toolchain matching these requirements: {languageVersion=19, vendor=ADOPTOPENJDK, implementation=vendor-specific} > Unable to download toolchain. This might indicate that the combination (version, architecture, release/early access, ...) for the requested JDK is not available. > Could not read 'https://api.adoptopenjdk.net/v3/binary/latest/19/ga/windows/x64/jdk/hotspot/normal/adoptopenjdk' as it does not exist. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] itygh commented on pull request #2670: Backport a few upgrades to branch_8_11
itygh commented on PR #2670: URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1253686367 这是来自QQ邮箱的假期自动回复邮件。您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy commented on pull request #2670: Backport a few upgrades to branch_8_11
janhoy commented on PR #2670: URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1253712939 Most of these should be safe as they are pure bugfix version upgrades. I see one `.java` class touched, which is also safe. Can you try to highlight which part of this PR is the most "risky"? I suppose it would be the 5 new jars/parsers pulled in by Tika? Or the addition of new Calcite deps we have not had before? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] risdenk commented on pull request #2670: Backport a few upgrades to branch_8_11
risdenk commented on PR #2670: URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1253717437 > Can you try to highlight which part of this PR is the most "risky"? I suppose it would be the 5 new jars/parsers pulled in by Tika? Or the addition of new Calcite deps we have not had before? Yea it would be the dependency upgrades. They could be incompatible with each other or have other subtle bugs. The guava upgrade is potentially risky - but will see when I run through all the tests shortly. I don't think any of these are terrible from a risk perspective. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on issue #11799: Indexing method for learned sparse retrieval
mocobeta commented on issue #11799: URL: https://github.com/apache/lucene/issues/11799#issuecomment-1253724856 In general I'm +1 for supporting learned sparse retrieval, though, I think it would not be so trivial as it looks. For a starter perhaps we could utilize terms' payloads to tweak the weights instead of modifying the indexing chain... but there may be some overheads in score calculation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gcbaptista opened a new issue, #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)
gcbaptista opened a new issue, #11800: URL: https://github.com/apache/lucene/issues/11800 ### Description Since release `9.1.0`, Lucene's SyntaxParser have been uncapable to parse `@` in a query, throwing a Syntax Error (`INVALID_SYNTAX_CANNOT_PARSE`). Version `9.0.0` is the last I tested that still can parse this character. Two similar examples that throw `INVALID_SYNTAX_CANNOT_PARSE` in `9.1.0`, but not in `9.0.0`: - `\\ an@tomy` - `\\ @natomy` Leaving some log dump here: `Syntax Error, cannot parse \\ an@tomy: INVALID_SYNTAX_CANNOT_PARSE: Syntax Error, cannot parse \\ an@tomy: at org.apache.lucene.queryparser.flexible.standard.parser.StandardSyntaxParser.generateParseException(StandardSyntaxParser.java:2093) at org.apache.lucene.queryparser.flexible.standard.parser.StandardSyntaxParser.jj_consume_token(StandardSyntaxParser.java:1961) at org.apache.lucene.queryparser.flexible.standard.parser.StandardSyntaxParser.TopLevelQuery(StandardSyntaxParser.java:115) at org.apache.lucene.queryparser.flexible.standard.parser.StandardSyntaxParser.parse(StandardSyntaxParser.java:92) at org.apache.lucene.queryparser.flexible.core.QueryParserHelper.parse(QueryParserHelper.java:214) at org.apache.lucene.queryparser.flexible.standard.StandardQueryParser.parse(StandardQueryParser.java:280) ...` ### Version and environment details - JDK: temurin-17.0.3 - OS: MacOS Monterey 12.6 and alpine-java17:latest (docker image) - Lucene: 9.1.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug
kotman12 commented on PR #11734: URL: https://github.com/apache/lucene/pull/11734#issuecomment-1253843509 > Hi @kotman12 . Sorry for the delay. I'm not that familiar with this part of the codebase but I think I see what's happening and how you managed to fix it. Looks good to me. It'd be good to run this pipeline over a larger text base to make sure there are no surprises and regressions. Hi @dweiss .. no worries, thanks for taking a look. I have made the changes you suggested. I agree that it would be good to test this with a large and more varied text base. I did incorporate these changes separately as a patch in my current project which is how I found that my initial proposal actually had a regression (it threw an NPE in case of empty input). It's possible there other bugs that I haven't uncovered .. did you have any particular test dataset in mind? --- Totally separately I noticed that in the case of stacking two or more sentence-aware token filters, sentence token iteration and attribute cloning are performed separately across the different filters (this is the case in the current implementation as well). This work could probably be done in a dedicated `SentenceExtractionFilter` which could then pass a reference to a _single_ `sentenceTokenAttributes` to all downstream filters that rely on sentence-level analysis (perhaps such a reference could be stashed in the proposed `SentenceAttribute`). Also, following this reasoning a bit further, it is possible to conceive future implementations that rely on passing an arbitrary subset of adjacent tokens together to some analysis function, in which case the use of the term `Sentence` would become a misnomer. However, addressing these issues would be more effort so I wanted to check if it is even worth it to explore. I am starting to suspect that this package is not really used a lot because otherwise one would expect these bugs to have been caught sooner given the configuration is in the documentation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] risdenk commented on pull request #2670: Backport a few upgrades to branch_8_11
risdenk commented on PR #2670: URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1253920621 Well I'm struggling to run tests on my M1 Mac. They don't like JDK 8 on M1. I'll have to test this separately on another computer or VM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11799: Indexing method for learned sparse retrieval
rmuir commented on issue #11799: URL: https://github.com/apache/lucene/issues/11799#issuecomment-1253945883 You can use `TermFrequencyAttribute` in the analysis chain to set the frequency directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy commented on pull request #2670: Backport a few upgrades to branch_8_11
janhoy commented on PR #2670: URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1253988574 Should there have been an optional GitHub Action to run all tests in a PR? Something you could activate on demand? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] risdenk commented on pull request #2670: Backport a few upgrades to branch_8_11
risdenk commented on PR #2670: URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1253995175 > Should there have been an optional GitHub Action to run all tests in a PR? Something you could activate on demand? eh its only an issue with 8.11 and I have ways to do it just forgot about jdk 8 and M1 being an issue :D I ran into this when testing 8.11.2 release and @madrob helped explain what was going on w/ M1. it would be cool but its not something I would spend a ton of time on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)
rmuir commented on issue #11800: URL: https://github.com/apache/lucene/issues/11800#issuecomment-1254008994 not a bug, but related to new features added to the parser. see the associated message in `MIGRATE.txt`: ``` ## Minor syntactical changes in StandardQueryParser (Lucene 9.1) LUCENE-10223 adds interval functions and min-should-match support to StandardQueryParser. This means that interval function prefixes ("fn:") and the '@' character after parentheses will parse differently than before. If you need the exact previous behavior, clone the StandardSyntaxParser from the previous version of Lucene and create a custom query parser with that parser. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)
dweiss commented on issue #11800: URL: https://github.com/apache/lucene/issues/11800#issuecomment-1254016674 Also, please note that you can quote the ampersand in terms - this will behave like before. I don't think it's a bug, sorry it caused you trouble but the new functionality is worth it (try it!). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #11734: Fix repeating token sentence boundary bug
dweiss commented on PR #11734: URL: https://github.com/apache/lucene/pull/11734#issuecomment-1254019222 > I am starting to suspect that this package is not really used a lot because otherwise one would expect these bugs to have been caught sooner given the configuration is in the documentation. You're very like right. There's a lot of old cruft that's mostly unused lying around. Please add lucene/CHANGES.txt entry (under 9.5.0) and I think it's done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
gautamworah96 commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r975969527 ## lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class ByteTrackingIndexOutput extends IndexOutput { + + private final IndexOutput output; + private final AtomicLong byteTracker; + + protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong byteTracker) { +super( +"Byte tracking wrapper for: " + output.getName(), +"ByteTrackingIndexOutput{" + output.getName() + "}"); +this.output = output; +this.byteTracker = byteTracker; + } + + @Override + public void writeByte(byte b) throws IOException { +byteTracker.incrementAndGet(); +output.writeByte(b); + } + + @Override + public void writeBytes(byte[] b, int offset, int length) throws IOException { +byteTracker.addAndGet(length); +output.writeBytes(b, offset, length); + } + + @Override + public void close() throws IOException { +output.close(); + } + + @Override + public long getFilePointer() { +return output.getFilePointer(); + } + + @Override + public long getChecksum() throws IOException { +return output.getChecksum(); + } + + public String getWrappedName() { Review Comment: Why do we need `getWrappedName` and `getWrappedToString`? They are already defined in the parent class ## lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class ByteTrackingIndexOutput extends IndexOutput { + + private final IndexOutput output; + private final AtomicLong byteTracker; + + protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong byteTracker) { +super( +"Byte tracking wrapper for: " + output.getName(), +"ByteTrackingIndexOutput{" + output.getName() + "}"); +this.output = output; +this.byteTracker = byteTracker; + } + + @Override + public void writeByte(byte b) throws IOException { +byteTracker.incrementAndGet(); Review Comment: We are not using the returned value here. Just use `increment` maybe? Same for the `writeBytes` method? ## lucene/core/src/java/org/apache/lucene/store/WriteAmplificationTrackingDirectoryWrapper.java: ## @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distr
[GitHub] [lucene] shaie merged pull request #11798: Minor refactoring and cleanup to taxonomy index code
shaie merged PR #11798: URL: https://github.com/apache/lucene/pull/11798 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on issue #11799: Indexing method for learned sparse retrieval
jtibshirani commented on issue #11799: URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254070746 +1 from me too, it'd be great to think through how to support this. Could you explain how the query side would look? Are the queries also sparse vectors with custom impacts? As a note, we have a `FeatureField` field type that accepts key-value pairs and stores the value in `TermFrequencyAttribute`. It's designed to help incorporate other storing signals like popularity, page rank, etc. It may not be exactly what we want for this use case, but it could provide some inspiration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #11797: DrillSideways uses advance instead of next when multiple dims miss
gsmiller commented on PR #11797: URL: https://github.com/apache/lucene/pull/11797#issuecomment-1254074997 Cancelling this out as I've realized we can do even better. I'll post a new PR with a few more optimizations baked in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller closed pull request #11797: DrillSideways uses advance instead of next when multiple dims miss
gsmiller closed pull request #11797: DrillSideways uses advance instead of next when multiple dims miss URL: https://github.com/apache/lucene/pull/11797 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stevenschlansker commented on issue #8553: Add AccessController.doPrivileged around all calls of Class#getResource() and Class#getResourceAsStream() [LUCENE-7502]
stevenschlansker commented on issue #8553: URL: https://github.com/apache/lucene/issues/8553#issuecomment-1254096210 AccessController is now deprecated for removal, as is the security manager. Is this issue still relevant? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stevenschlansker commented on issue #6534: Classloader issues when running Lucene under a java SecurityManager [LUCENE-5471]
stevenschlansker commented on issue #6534: URL: https://github.com/apache/lucene/issues/6534#issuecomment-1254097488 SecurityManager is now deprecate for removal, so this issue might no longer be relevant going forward. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stevenschlansker opened a new issue, #11801: Remove usage of SecurityManager and AccessController
stevenschlansker opened a new issue, #11801: URL: https://github.com/apache/lucene/issues/11801 ### Description Java is removing the SecurityManager and AccessController. Running Lucene build under Java 17 emits a lot of warnings: ``` WARNING: A command line option has enabled the Security Manager WARNING: The Security Manager is deprecated and will be removed in a future release ``` In a future release, this will break the build. Lucene should remove all uses of SecurityManager and AccessController to work in future Java versions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on issue #11799: Indexing method for learned sparse retrieval
msokolov commented on issue #11799: URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254100366 Using `TermFrequencyAttribute` to customize the term frequencies you can then create a Query in the normal way and compute BM25 using `b==0` then I think you will directly control the similarity scores. Or you might want to write a custom Similarity to be a bit more efficient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11801: Remove usage of SecurityManager and AccessController
rmuir commented on issue #11801: URL: https://github.com/apache/lucene/issues/11801#issuecomment-1254102841 We use it to sandbox our tests, so we shouldn't remove it without replacement. Otherwise tests might interfere with each other which is not fun to debug. Additionally as a library, we need to support these APIs properly for applications that use the security manager (e.g. elasticsearch). We should support it as long as possible to give such apps time to "replace" as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11801: Remove usage of SecurityManager and AccessController
rmuir commented on issue #11801: URL: https://github.com/apache/lucene/issues/11801#issuecomment-1254114559 for the tests i have a couple ideas: * use forbidden-apis more aggressively to statically prevent tests from doing stuff we don't want. Actually more powerful for our use-case in a lot of ways, e.g. we should ban `Thread.sleep()` :) * add `mockfs` layer to enforce tests only write to their own unique directory. Enforcing the filesystem access is isolated is key, but this should work almost as well as security manager (we don't have many dependencies using the old `java.io` etc that would bypass it) for the situation of being a library and needing to support apps that still rely on securitymanager, I don't see any immediate fix. because the only way to know the security code works, is to run our tests with security manager enabled... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] thongnt99 commented on issue #11799: Indexing method for learned sparse retrieval
thongnt99 commented on issue #11799: URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254119695 @jtibshirani The query side is same as document side, which is a dictionary of terms and weights. To make it compatible with Lucene, people just repeat the terms with its frequency. This is fine because queries are usually much shorter. Yes, FeatureField is something similar, but we want a single Field containing a list of key-value pairs or a json formatted. @msokolov @rmuir @mocobeta: I fould [this](https://github.com/apache/lucene/blob/475fbd0bdde31c6a2ae62c59505cf9e8becd50e4/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.java), which could somehow achieves what we want; But I think it is not so flexible, we need to turn the json file into a token stream formatted as: [..] ... I think this step is redundant. Can we just load the json file directly? For this I think we might have to move away from TokenStream pipeline? What do you think? Your thought is very much appreciated as I am not very familiar with Lucene. We can form a group to do this if you guys are interested in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stevenschlansker commented on issue #11801: Remove usage of SecurityManager and AccessController
stevenschlansker commented on issue #11801: URL: https://github.com/apache/lucene/issues/11801#issuecomment-1254120085 > for the situation of being a library and needing to support apps that still rely on securitymanager, I don't see any immediate fix. because the only way to know the security code works, is to run our tests with security manager enabled... Yes, this is going to be a challenge. Some apps will want to be on Java Latest, which will not even have the types defined. Other apps will still run on Java 8, even 20 years later ;) , and supporting both will be tricky. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
gautamworah96 commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r976900966 ## lucene/core/src/java/org/apache/lucene/store/WriteAmplificationTrackingDirectoryWrapper.java: ## @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; + +/** {@link FilterDirectory} that tracks write amplification factor */ +public final class WriteAmplificationTrackingDirectoryWrapper extends FilterDirectory { + + private final AtomicLong flushedBytes = new AtomicLong(); + private final AtomicLong mergedBytes = new AtomicLong(); + + /** + * Sole constructor, typically called from sub-classes. + * + * @param in input Directory + */ + public WriteAmplificationTrackingDirectoryWrapper(Directory in) { +super(in); + } + + @Override + public IndexOutput createOutput(String name, IOContext context) throws IOException { +IndexOutput output = in.createOutput(name, context); +IndexOutput byteTrackingIndexOutput; +if (context.context.equals(IOContext.Context.FLUSH)) { Review Comment: `context.context` is a bit confusing. Lets rename the method param? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11801: Remove usage of SecurityManager and AccessController
rmuir commented on issue #11801: URL: https://github.com/apache/lucene/issues/11801#issuecomment-1254139967 I'm not worried, according to the JEP: https://openjdk.org/jeps/411 ``` In feature releases after Java 18, we will degrade other Security Manager APIs so that they remain in place but with limited or no functionality. For example, we may revise AccessController::doPrivileged simply to run the given action, or revise System::getSecurityManager always to return null. This will allow libraries that support the Security Manager and were compiled against previous Java releases to continue to work without change or even recompilation. We expect to remove the APIs once the compatibility risk of doing so declines to an acceptable level. ``` So it seems these APIs will become "no-ops" first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kotman12 opened a new pull request, #11802: fix sentence iteration in opennlp package
kotman12 opened a new pull request, #11802: URL: https://github.com/apache/lucene/pull/11802 Fix sentence boundary detection bug in case of repeating tokens (i.e. while using OpenNLP analysis chain in conjunction with a KeywordRepeatFilter) by keeping track of the sentence index and looking ahead one token. Move inner sentence iteration to a component to be shared by the sentence-aware OpenNLP filters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug
kotman12 commented on PR #11734: URL: https://github.com/apache/lucene/pull/11734#issuecomment-1254253617 @dweiss I updated CHANGES.txt but blew up this PR and messed up the history in the process. If you prefer this is a more concise PR with the relevant changes patched in -> https://github.com/apache/lucene/pull/11802 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mdmarshmallow commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r977019915 ## lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class ByteTrackingIndexOutput extends IndexOutput { + + private final IndexOutput output; + private final AtomicLong byteTracker; + + protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong byteTracker) { +super( +"Byte tracking wrapper for: " + output.getName(), +"ByteTrackingIndexOutput{" + output.getName() + "}"); +this.output = output; +this.byteTracker = byteTracker; + } + + @Override + public void writeByte(byte b) throws IOException { +byteTracker.incrementAndGet(); +output.writeByte(b); + } + + @Override + public void writeBytes(byte[] b, int offset, int length) throws IOException { +byteTracker.addAndGet(length); +output.writeBytes(b, offset, length); + } + + @Override + public void close() throws IOException { +output.close(); + } + + @Override + public long getFilePointer() { +return output.getFilePointer(); + } + + @Override + public long getChecksum() throws IOException { +return output.getChecksum(); + } + + public String getWrappedName() { Review Comment: `super#getName()` and `super#toString()` would give us the name and String representation of this class. I made these in case someone wants the access the wrapped class's name and String representation. With that being said, I'm not really sure how useful these two methods would be. I could remove them? ## lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class ByteTrackingIndexOutput extends IndexOutput { + + private final IndexOutput output; + private final AtomicLong byteTracker; + + protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong byteTracker) { +super( +"Byte tracking wrapper for: " + output.getName(), +"ByteTrackingIndexOutput{" + output.getName() + "}"); +this.output = output; +this.byteTracker = byteTracker; + } + + @Override + public void writeByte(byte b) throws IOException { +byteTracker.incrementAndGet(); Review Comment: Unfortunately, there is no pure `increment()` method: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/atomic/AtomicLong.html ## lucene/core/src/java/org/apache/lucene/store/WriteAmplificationTrackingDirectoryWrapper.java: ## @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this fil
[GitHub] [lucene] gsmiller opened a new pull request, #11803: DrillSideways optimizations
gsmiller opened a new pull request, #11803: URL: https://github.com/apache/lucene/pull/11803 ### Description This change makes use of `advance` instead of `next` where possible and splits out 1st and 2nd phase checking to avoid match confirmation when unnecessary. Note that I only focused on the `doQueryFirstScoring` implementation here and didn't modify the other two scoring approaches. "Progress not perfection" and all that (plus, I think we should strongly consider removing these other two implementations, but we'd want to benchmark to be certain). Unfortunately, `luceneutil` doesn't have dedicated drill sideways benchmarks, but some benchmarks on our internal software that makes use of drill sideways showed a +2% QPS improvement and no obvious regressions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
wjp719 commented on PR #687: URL: https://github.com/apache/lucene/pull/687#issuecomment-1254416217 > Thanks, this looks good to me! Can you add a CHANGES entry with your name under 9.5? Thanks a lot, I have added the Change entry. And This PR has a limitation that only index sorting by ascend order can use bkd binary search to get min/max docId. The main reason is that bkd now sorts point by two dimension (point value, docId) both in ascend order. If doc is index-sorted by ascend order, then all the docId of all leaf point will be monotone increasing, so we can use bkd binay search. In our local work, if doc is index-sorted by descend order, we modify the bkd sorting logic by (point value in ascend order , docId in descend order), so that all the docId of all leaf point will be monotone decreasing, then we can use bkd binay search again. So May I open another PR to add an option that BKD can sort by (point value in ascend order , docId in descend order)? then the bkd binary search can work in both ascend/descend index sorting -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] risdenk commented on pull request #2670: Backport a few upgrades to branch_8_11
risdenk commented on PR #2670: URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1254471773 Took a few runs but got a pass: ``` BUILD SUCCESSFUL Total time: 75 minutes 41 seconds ``` all the failures didn't reproduce when run independently so I don't think they had anything to do with these fixes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] risdenk merged pull request #2670: Backport a few upgrades to branch_8_11
risdenk merged PR #2670: URL: https://github.com/apache/lucene-solr/pull/2670 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #11793: Prevent PointValues from returning null for ghost fields
jpountz commented on PR #11793: URL: https://github.com/apache/lucene/pull/11793#issuecomment-1254593758 Test failures suggest CheckIndex needs to have its expectations adjusted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
jpountz commented on PR #687: URL: https://github.com/apache/lucene/pull/687#issuecomment-1254600722 I was wondering about descending sorts too! Do we actually need to make this configurable on BKD trees, I would rather not add this option and make the binary search logic a bit more complex/inefficient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
jpountz merged PR #687: URL: https://github.com/apache/lucene/pull/687 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org