[GitHub] [lucene] rmuir commented on issue #11869: Add RangeOnRangeFacetCounts
rmuir commented on issue #11869: URL: https://github.com/apache/lucene/issues/11869#issuecomment-1385822481 Closing as the PR has been merged and is in the 9.5.0 section of CHANGES.txt -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11795: Add FilterDirectory to track write amplification factor
rmuir commented on issue #11795: URL: https://github.com/apache/lucene/issues/11795#issuecomment-1385823162 Closing as the PR has been merged and is in the 9.5.0 section of CHANGES.txt -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir closed issue #11795: Add FilterDirectory to track write amplification factor
rmuir closed issue #11795: Add FilterDirectory to track write amplification factor URL: https://github.com/apache/lucene/issues/11795 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir closed issue #11869: Add RangeOnRangeFacetCounts
rmuir closed issue #11869: Add RangeOnRangeFacetCounts URL: https://github.com/apache/lucene/issues/11869 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new pull request, #12089: [DRAFT] Explore TermInSet Query that "self optimizes"
gsmiller opened a new pull request, #12089: URL: https://github.com/apache/lucene/pull/12089 ### Description This is a DRAFT PR to sketch out the idea of a "self optimizing" TermInSetQuery. The idea is to build on the new `KeywordField` being proposed in #12054, which indexes both postings and DV data. It takes a bit of a different approach though as compared to `IndexOrDocValuesQuery` by "internally" deciding whether to use postings vs. doc values (at the segment granularity). Please note that there are many TODOs in here and I haven't done any benchmarking, etc. I've written light tests to convince myself it works (I've made sure all branches have been exercised), but it's highly likely there are bugs. I'm putting this out there for discussion only. My plan is to benchmark this approach as a next step, but I wanted to float the idea early to see if anyone has feedback or other ideas. Also, if someone loves the idea and wants to run with it, please go for it. I'm pretty busy for the next couple of week and I'm not sure when I'll come back to this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12054: Introduce a new `KeywordField`.
gsmiller commented on PR #12054: URL: https://github.com/apache/lucene/pull/12054#issuecomment-1385952712 Somewhat related to this PR, I've been experimenting with the idea of a "self optimizing" `TermInSetQuery` implementation that toggles between using postings and doc values based on index statistics, etc. I wanted to link that idea here as it's a bit related (requires indexing both postings and dv, which this PR makes easy). This is just an early idea, but I'll link an early draft here in case anyone is curious or has thoughts: #12089 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"
rmuir commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1385954675 Thanks for looking at this. I can alter benchmark from #12087 to test this case, honestly we could even just take the benchmark and index the numeric field as a string instead as a start :) In the case of numeric fields, we just had crazy query in the sandbox which is better as e.g. NumericDocValues.newSlowSetQuery. And then we hooked into IntField etc as newSlowSetQuery with the IndexOrDocValuesQuery. For that one, we had to fix PointInSetQuery to support ScorerSupplier etc (but TermInSetQuery already has this cost estimation) I was naively thinking to try to the same approach with the DocValuesTermsQuery that is in sandbox... though I anticipated maybe more trickiness with inverted index as opposed to points. But maybe IndexOrDocValuesQuery would surprise me again, of course its probably worth exploring anyway, we could compare the approaches. I do like IndexOrDocValuesQuery for solving these problems and if we can improve it to keep it generic, i'd definitely be in favor of that. But whatever is fastest wins :) I do think its important to add these fields such as KeywordField and put this best-practice logic behind simple methods so that it is easier on the user. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"
rmuir commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1385982331 I modified the benchmark from #12087 to just use StringField instead of IntField. The queries are supposed to be "hard" in that I'm not trying to benchmark what is necessarily typical, instead target "hard" stuff that is more worst-case (e.g. we shouldn't cause regressions vs `new PointInSetQuery()` in main branch today): [StringSetBenchmark.java.txt](https://github.com/apache/lucene/files/10438949/StringSetBenchmark.java.txt) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"
rmuir commented on code in PR #12089: URL: https://github.com/apache/lucene/pull/12089#discussion_r1072830614 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java: ## @@ -0,0 +1,527 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.queries; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collection; +import java.util.HashSet; +import java.util.Iterator; +import java.util.Objects; +import java.util.Set; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.LeafReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.PostingsEnum; +import org.apache.lucene.index.SortedDocValues; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.index.Term; +import org.apache.lucene.index.TermState; +import org.apache.lucene.index.Terms; +import org.apache.lucene.index.TermsEnum; +import org.apache.lucene.search.ConstantScoreScorer; +import org.apache.lucene.search.ConstantScoreWeight; +import org.apache.lucene.search.DisiPriorityQueue; +import org.apache.lucene.search.DisiWrapper; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.MatchNoDocsQuery; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.QueryVisitor; +import org.apache.lucene.search.ScoreMode; +import org.apache.lucene.search.Scorer; +import org.apache.lucene.search.ScorerSupplier; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.TwoPhaseIterator; +import org.apache.lucene.search.Weight; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.DocIdSetBuilder; +import org.apache.lucene.util.LongBitSet; +import org.apache.lucene.util.PriorityQueue; + +public class TermInSetQuery extends Query { + // TODO: tunable coefficients. need to actually tune them (or maybe these are too complex and not + // useful) + private static final double J = 1.0; + private static final double K = 1.0; + // L: postings lists under this threshold will always be "pre-processed" into a bitset + private static final int L = 512; + // M: max number of clauses we'll manage/check during scoring (these remain "unprocessed") + private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64); + + private final String field; + // TODO: Not particularly memory-efficient; could use prefix-coding here but sorting isn't free + private final BytesRef[] terms; Review Comment: but this is really bad for performance to be unsorted: it means we do a bunch of random access lookups in the terms dictionaries: looping over these unsorted terms and doing seekExact, looping over these unsorted terms and doing lookupOrd that could instead easily be sequential/more friendly. Given that sometimes looking up all the terms is a very heavy cost for this thing, calling `Arrays.sort()` in the ctor seems like an easy-win/no-brainer. As is the prefix-coding to keep the RAM usage lower in the worst-case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"
gsmiller commented on code in PR #12089: URL: https://github.com/apache/lucene/pull/12089#discussion_r1072835306 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java: ## @@ -0,0 +1,527 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.queries; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collection; +import java.util.HashSet; +import java.util.Iterator; +import java.util.Objects; +import java.util.Set; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.LeafReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.PostingsEnum; +import org.apache.lucene.index.SortedDocValues; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.index.Term; +import org.apache.lucene.index.TermState; +import org.apache.lucene.index.Terms; +import org.apache.lucene.index.TermsEnum; +import org.apache.lucene.search.ConstantScoreScorer; +import org.apache.lucene.search.ConstantScoreWeight; +import org.apache.lucene.search.DisiPriorityQueue; +import org.apache.lucene.search.DisiWrapper; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.MatchNoDocsQuery; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.QueryVisitor; +import org.apache.lucene.search.ScoreMode; +import org.apache.lucene.search.Scorer; +import org.apache.lucene.search.ScorerSupplier; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.TwoPhaseIterator; +import org.apache.lucene.search.Weight; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.DocIdSetBuilder; +import org.apache.lucene.util.LongBitSet; +import org.apache.lucene.util.PriorityQueue; + +public class TermInSetQuery extends Query { + // TODO: tunable coefficients. need to actually tune them (or maybe these are too complex and not + // useful) + private static final double J = 1.0; + private static final double K = 1.0; + // L: postings lists under this threshold will always be "pre-processed" into a bitset + private static final int L = 512; + // M: max number of clauses we'll manage/check during scoring (these remain "unprocessed") + private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64); + + private final String field; + // TODO: Not particularly memory-efficient; could use prefix-coding here but sorting isn't free + private final BytesRef[] terms; Review Comment: I generally agree. I was testing some interesting use-cases though with our current `TermInSetQuery` where we have 10,000+ terms (PK-type field), and the sorting it does actually came with a significant cost (and we got a pretty good win by removing it). But that's a bit of a different use-case maybe now that I think about it. We have a bloom filter in place as well, and a good share of the terms aren't actually in the index. So there's that... OK, yeah, we probably ought to sort here in a general implementation :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"
gsmiller commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1386068784 @rmuir > I was naively thinking to try to the same approach with the DocValuesTermsQuery that is in sandbox... I think that's probably a good place to start honestly. I was thinking of introducing this in the sandbox module initially and not actually hooking it into `KeywordField`. Then maybe following up by graduating it and hooking it in? But maybe it's worth doing initially if we can get it right? I dunno. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"
rmuir commented on code in PR #12089: URL: https://github.com/apache/lucene/pull/12089#discussion_r1072841550 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java: ## @@ -0,0 +1,527 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.queries; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collection; +import java.util.HashSet; +import java.util.Iterator; +import java.util.Objects; +import java.util.Set; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.LeafReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.PostingsEnum; +import org.apache.lucene.index.SortedDocValues; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.index.Term; +import org.apache.lucene.index.TermState; +import org.apache.lucene.index.Terms; +import org.apache.lucene.index.TermsEnum; +import org.apache.lucene.search.ConstantScoreScorer; +import org.apache.lucene.search.ConstantScoreWeight; +import org.apache.lucene.search.DisiPriorityQueue; +import org.apache.lucene.search.DisiWrapper; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.MatchNoDocsQuery; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.QueryVisitor; +import org.apache.lucene.search.ScoreMode; +import org.apache.lucene.search.Scorer; +import org.apache.lucene.search.ScorerSupplier; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.TwoPhaseIterator; +import org.apache.lucene.search.Weight; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.DocIdSetBuilder; +import org.apache.lucene.util.LongBitSet; +import org.apache.lucene.util.PriorityQueue; + +public class TermInSetQuery extends Query { + // TODO: tunable coefficients. need to actually tune them (or maybe these are too complex and not + // useful) + private static final double J = 1.0; + private static final double K = 1.0; + // L: postings lists under this threshold will always be "pre-processed" into a bitset + private static final int L = 512; + // M: max number of clauses we'll manage/check during scoring (these remain "unprocessed") + private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64); + + private final String field; + // TODO: Not particularly memory-efficient; could use prefix-coding here but sorting isn't free + private final BytesRef[] terms; Review Comment: When I think of worst-case, I'm assuming where these term dictionaries don't even fit in RAM. We should really always assume this stuff doesn't fit in RAM :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"
rmuir commented on code in PR #12089: URL: https://github.com/apache/lucene/pull/12089#discussion_r1072855208 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java: ## @@ -0,0 +1,527 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.queries; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collection; +import java.util.HashSet; +import java.util.Iterator; +import java.util.Objects; +import java.util.Set; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.LeafReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.PostingsEnum; +import org.apache.lucene.index.SortedDocValues; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.index.Term; +import org.apache.lucene.index.TermState; +import org.apache.lucene.index.Terms; +import org.apache.lucene.index.TermsEnum; +import org.apache.lucene.search.ConstantScoreScorer; +import org.apache.lucene.search.ConstantScoreWeight; +import org.apache.lucene.search.DisiPriorityQueue; +import org.apache.lucene.search.DisiWrapper; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.MatchNoDocsQuery; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.QueryVisitor; +import org.apache.lucene.search.ScoreMode; +import org.apache.lucene.search.Scorer; +import org.apache.lucene.search.ScorerSupplier; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.TwoPhaseIterator; +import org.apache.lucene.search.Weight; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.DocIdSetBuilder; +import org.apache.lucene.util.LongBitSet; +import org.apache.lucene.util.PriorityQueue; + +public class TermInSetQuery extends Query { + // TODO: tunable coefficients. need to actually tune them (or maybe these are too complex and not + // useful) + private static final double J = 1.0; + private static final double K = 1.0; + // L: postings lists under this threshold will always be "pre-processed" into a bitset + private static final int L = 512; + // M: max number of clauses we'll manage/check during scoring (these remain "unprocessed") + private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64); + + private final String field; + // TODO: Not particularly memory-efficient; could use prefix-coding here but sorting isn't free + private final BytesRef[] terms; Review Comment: you can use this tool when benchmarking to help make sure index no longer fits in RAM: https://github.com/mikemccand/luceneutil/blob/b48e7f49b19c27367737436214cc1ce7e67ad32c/src/python/ramhog.c or you can open up the computer and remove DIMMs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"
gsmiller commented on code in PR #12089: URL: https://github.com/apache/lucene/pull/12089#discussion_r1072871141 ## lucene/core/src/java/org/apache/lucene/search/DisiWrapper.java: ## @@ -57,4 +57,14 @@ public DisiWrapper(Scorer scorer) { matchCost = 0f; } } + + public DisiWrapper(DocIdSetIterator iterator) { Review Comment: This change is common to #12055, so I'm hoping we'd actually land it as part of that work and not needed just for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"
gsmiller commented on code in PR #12089: URL: https://github.com/apache/lucene/pull/12089#discussion_r1072872477 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java: ## @@ -0,0 +1,527 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.queries; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collection; +import java.util.HashSet; +import java.util.Iterator; +import java.util.Objects; +import java.util.Set; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.LeafReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.PostingsEnum; +import org.apache.lucene.index.SortedDocValues; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.index.Term; +import org.apache.lucene.index.TermState; +import org.apache.lucene.index.Terms; +import org.apache.lucene.index.TermsEnum; +import org.apache.lucene.search.ConstantScoreScorer; +import org.apache.lucene.search.ConstantScoreWeight; +import org.apache.lucene.search.DisiPriorityQueue; +import org.apache.lucene.search.DisiWrapper; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.MatchNoDocsQuery; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.QueryVisitor; +import org.apache.lucene.search.ScoreMode; +import org.apache.lucene.search.Scorer; +import org.apache.lucene.search.ScorerSupplier; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.TwoPhaseIterator; +import org.apache.lucene.search.Weight; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.DocIdSetBuilder; +import org.apache.lucene.util.LongBitSet; +import org.apache.lucene.util.PriorityQueue; + +public class TermInSetQuery extends Query { + // TODO: tunable coefficients. need to actually tune them (or maybe these are too complex and not + // useful) + private static final double J = 1.0; + private static final double K = 1.0; + // L: postings lists under this threshold will always be "pre-processed" into a bitset + private static final int L = 512; + // M: max number of clauses we'll manage/check during scoring (these remain "unprocessed") + private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64); + + private final String field; + // TODO: Not particularly memory-efficient; could use prefix-coding here but sorting isn't free + private final BytesRef[] terms; Review Comment: That's a good point/perspective. I'm convinced. It's easy enough to borrow those ideas from our existing `TermInSetQuery`, so I'll do that. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"
gsmiller commented on code in PR #12089: URL: https://github.com/apache/lucene/pull/12089#discussion_r1072874867 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java: ## @@ -0,0 +1,527 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.queries; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collection; +import java.util.HashSet; +import java.util.Iterator; +import java.util.Objects; +import java.util.Set; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.LeafReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.PostingsEnum; +import org.apache.lucene.index.SortedDocValues; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.index.Term; +import org.apache.lucene.index.TermState; +import org.apache.lucene.index.Terms; +import org.apache.lucene.index.TermsEnum; +import org.apache.lucene.search.ConstantScoreScorer; +import org.apache.lucene.search.ConstantScoreWeight; +import org.apache.lucene.search.DisiPriorityQueue; +import org.apache.lucene.search.DisiWrapper; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.MatchNoDocsQuery; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.QueryVisitor; +import org.apache.lucene.search.ScoreMode; +import org.apache.lucene.search.Scorer; +import org.apache.lucene.search.ScorerSupplier; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.TwoPhaseIterator; +import org.apache.lucene.search.Weight; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.DocIdSetBuilder; +import org.apache.lucene.util.LongBitSet; +import org.apache.lucene.util.PriorityQueue; + +public class TermInSetQuery extends Query { + // TODO: tunable coefficients. need to actually tune them (or maybe these are too complex and not + // useful) + private static final double J = 1.0; + private static final double K = 1.0; + // L: postings lists under this threshold will always be "pre-processed" into a bitset + private static final int L = 512; + // M: max number of clauses we'll manage/check during scoring (these remain "unprocessed") + private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64); + + private final String field; + // TODO: Not particularly memory-efficient; could use prefix-coding here but sorting isn't free + private final BytesRef[] terms; + private final int termsHashCode; + + public TermInSetQuery(String field, Collection terms) { +this.field = field; + +final Set uniqueTerms; +if (terms instanceof Set) { + uniqueTerms = (Set) terms; +} else { + uniqueTerms = new HashSet<>(terms); +} +this.terms = new BytesRef[uniqueTerms.size()]; +Iterator it = uniqueTerms.iterator(); +for (int i = 0; i < uniqueTerms.size(); i++) { + assert it.hasNext(); + this.terms[i] = it.next(); +} +// TODO: compute lazily? +termsHashCode = Arrays.hashCode(this.terms); + } + + @Override + public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) + throws IOException { + +return new ConstantScoreWeight(this, boost) { + + @Override + public Scorer scorer(LeafReaderContext context) throws IOException { +ScorerSupplier supplier = scorerSupplier(context); +if (supplier == null) { + return null; +} else { + return supplier.get(Long.MAX_VALUE); +} + } + + @Override + public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOException { +if (terms.length <= 1) { + throw new IllegalStateException("Must call IndexSearcher#rewrite"); +} + +// If the field doesn't exist in the segment, return null: +LeafReader reader = context.reader(); +FieldInfo fi = reader.getFieldInfos().fieldInfo(field); +if (fi == null) { + return null; +} + +return
[GitHub] [lucene] mulugetam opened a new issue, #12090: Building a Lucene posting format that leverages the Java Vector API
mulugetam opened a new issue, #12090: URL: https://github.com/apache/lucene/issues/12090 ### Description This issue is to start a conversation on implementing a vectorized encoding and decoding scheme for postings. A few months ago, we implemented vectorized integer compression based on the [JavaFastPFOR](https://github.com/lemire/JavaFastPFOR) library. That code has since been [merged](https://github.com/lemire/JavaFastPFOR/pull/51). Performance results, based on [JMH](https://github.com/openjdk/jmh), show [significant gains](https://github.com/mulugetam/VectorJavaFastPFOR) in performance compared to the default JavaFastPFOR. We would, of course, need to benchmark the vectorized PostingsFormat against the existing implementation. @jpountz What's your take on it, and how should we go about it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mulugetam opened a new issue, #12091: Speeding up Lucene Vector Similarity through the Java Vector API
mulugetam opened a new issue, #12091: URL: https://github.com/apache/lucene/issues/12091 ### Description Lucene's implementation of ANN relies on a scalar implementation of the vector similarity functions [dot-product,](https://github.com/apache/lucene/blob/4fe8424925ca404d335fa41d261545d3182c22fa/lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java#L53) [Euclidean distance](https://github.com/apache/lucene/blob/4fe8424925ca404d335fa41d261545d3182c22fa/lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java#L34), and [cosine](https://github.com/apache/lucene/blob/4fe8424925ca404d335fa41d261545d3182c22fa/lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java#L71). The vector implementation of these functions is quite straightforward. Below is performance data I got, based on JMH, comparing the vector implementation of the `dot product` and `Euclidean` against the equivalent default (scalar with loop-unrolling) implementation. `dim` is the dimension/length of the `float[]` arrays in test and `score` is the number of dot product/Euclidean distance operations done per second. ``` Benchmark dim ModeCnt Score Units Gain -- scalarDotProduct 60 thrpt 1232031825.541 ± 6151.580 ops/s 1.00 scalarDotProduct 120 thrpt 1217120537.911 ± 5793.505 ops/s 1.00 scalarDotProduct 480 thrpt 12 4506350.215 ± 1677.755 ops/s 1.00 vectorDotProduct 60 thrpt 1298862701.038 ± 85554.695 ops/s 3.09 vectorDotProduct 120 thrpt 1299059913.888 ± 20609.182 ops/s 5.79 vectorDotProduct 480 thrpt 12 220320941.436 ± 173467.603 ops/s 48.89 ``` ``` Benchmark dim ModeCnt Score Units Gain -- scalarSquareDistance 60 thrpt 12 25890614.822 ± 7071.413 ops/s 1.00 scalarSquareDistance 120 thrpt 12 12524294.760 ± 3435.882 ops/s 1.00 scalarSquareDistance 480 thrpt 12 3145045.026 ± 409.361 ops/s 1.00 vectorSquareDistance 60 thrpt 12 104317302.765 ± 36895.474 ops/s 4.03 vectorSquareDistance 120 thrpt 12 122083614.889 ± 11821.642 ops/s 9.75 vectorSquareDistance 480 thrpt 12 362229408.898 ± 85439.065 ops/s 115.17 ``` I have also tested the same with [Msokolov's ANN benchmark suite](https://github.com/msokolov/ann-benchmarks) and saw a speedup of more than 2x in indexing (docs/sec) and search performance (QPS). Will do a PR for it soon. Let's discuss this :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jebnix commented on issue #11870: Create a Markdown based documentation
jebnix commented on issue #11870: URL: https://github.com/apache/lucene/issues/11870#issuecomment-1386297416 @uschindler That's nice, but I personally miss two things about the Lucene repo: 1. The ability to find the documentation in a central place (that makes the contribution much easier). That's the way most repositories manage project documentation. The Javadoc in my opinion is good for code-related notes, and separate docs (that are currently inside `package-info.java`) - inside a `docs/` dir. 2. Some generated **good-looking** documentation site. I suggest Docusaurus usage, so the docs will get written in MD, and look like [this for example](https://redux.js.org/). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma opened a new pull request, #12092: Remove UTF8TaxonomyWriterCache
vigyasharma opened a new pull request, #12092: URL: https://github.com/apache/lucene/pull/12092 As per the discussion in PR #12013, this change removes the never evicting `UTF8TaxonomyWriterCache` and uses `LruTaxonomyWriterCache` as the default taxonomy writer cache implementation. Addresses #12000 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on pull request #12013: Clear thread local values on UTF8TaxonomyWriterCache.close()
vigyasharma commented on PR #12013: URL: https://github.com/apache/lucene/pull/12013#issuecomment-1386545577 Created a separate PR - #12092 to remove support for `UTF8TaxonomyWriterCache` from main. Will close this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma closed pull request #12013: Clear thread local values on UTF8TaxonomyWriterCache.close()
vigyasharma closed pull request #12013: Clear thread local values on UTF8TaxonomyWriterCache.close() URL: https://github.com/apache/lucene/pull/12013 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma merged pull request #12045: fix typo in KoreanNumberFilter
vigyasharma merged PR #12045: URL: https://github.com/apache/lucene/pull/12045 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma opened a new pull request, #12093: Deprecate support for UTF8TaxonomyWriterCache
vigyasharma opened a new pull request, #12093: URL: https://github.com/apache/lucene/pull/12093 As discussed in PR #12013 , deprecating support for `UTF8TaxonomyWriterCache` in branch_9x. Addresses #12000 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on pull request #12013: Clear thread local values on UTF8TaxonomyWriterCache.close()
vigyasharma commented on PR #12013: URL: https://github.com/apache/lucene/pull/12013#issuecomment-1386565076 PR - https://github.com/apache/lucene/pull/12093 to deprecate `UTF8TaxonomyWriterCache` in 9.x -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org