[GitHub] [lucene] rmuir commented on issue #11869: Add RangeOnRangeFacetCounts

2023-01-17 Thread GitBox


rmuir commented on issue #11869:
URL: https://github.com/apache/lucene/issues/11869#issuecomment-1385822481

   Closing as the PR has been merged and is in the 9.5.0 section of CHANGES.txt


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11795: Add FilterDirectory to track write amplification factor

2023-01-17 Thread GitBox


rmuir commented on issue #11795:
URL: https://github.com/apache/lucene/issues/11795#issuecomment-1385823162

   Closing as the PR has been merged and is in the 9.5.0 section of CHANGES.txt


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir closed issue #11795: Add FilterDirectory to track write amplification factor

2023-01-17 Thread GitBox


rmuir closed issue #11795: Add FilterDirectory to track write amplification 
factor
URL: https://github.com/apache/lucene/issues/11795


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir closed issue #11869: Add RangeOnRangeFacetCounts

2023-01-17 Thread GitBox


rmuir closed issue #11869: Add RangeOnRangeFacetCounts
URL: https://github.com/apache/lucene/issues/11869


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request, #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

2023-01-17 Thread GitBox


gsmiller opened a new pull request, #12089:
URL: https://github.com/apache/lucene/pull/12089

   ### Description
   
   This is a DRAFT PR to sketch out the idea of a "self optimizing" 
TermInSetQuery. The idea is to build on the new `KeywordField` being proposed 
in #12054, which indexes both postings and DV data. It takes a bit of a 
different approach though as compared to `IndexOrDocValuesQuery` by 
"internally" deciding whether to use postings vs. doc values (at the segment 
granularity).
   
   Please note that there are many TODOs in here and I haven't done any 
benchmarking, etc. I've written light tests to convince myself it works (I've 
made sure all branches have been exercised), but it's highly likely there are 
bugs.
   
   I'm putting this out there for discussion only. My plan is to benchmark this 
approach as a next step, but I wanted to float the idea early to see if anyone 
has feedback or other ideas. Also, if someone loves the idea and wants to run 
with it, please go for it. I'm pretty busy for the next couple of week and I'm 
not sure when I'll come back to this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12054: Introduce a new `KeywordField`.

2023-01-17 Thread GitBox


gsmiller commented on PR #12054:
URL: https://github.com/apache/lucene/pull/12054#issuecomment-1385952712

   Somewhat related to this PR, I've been experimenting with the idea of a 
"self optimizing" `TermInSetQuery` implementation that toggles between using 
postings and doc values based on index statistics, etc. I wanted to link that 
idea here as it's a bit related (requires indexing both postings and dv, which 
this PR makes easy). This is just an early idea, but I'll link an early draft 
here in case anyone is curious or has thoughts: #12089


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

2023-01-17 Thread GitBox


rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1385954675

   Thanks for looking at this. I can alter benchmark from #12087 to test this 
case, honestly we could even just take the benchmark and index the numeric 
field as a string instead as a start :)
   
   In the case of numeric fields, we just had crazy query in the sandbox which 
is better as e.g. NumericDocValues.newSlowSetQuery. And then we hooked into 
IntField etc as newSlowSetQuery with the IndexOrDocValuesQuery. For that one, 
we had to fix PointInSetQuery to support ScorerSupplier etc (but TermInSetQuery 
already has this cost estimation)
   
   I was naively thinking to try to the same approach with the 
DocValuesTermsQuery that is in sandbox... though I anticipated maybe more 
trickiness with inverted index as opposed to points. But maybe 
IndexOrDocValuesQuery would surprise me again, of course its probably worth 
exploring anyway, we could compare the approaches. I do like 
IndexOrDocValuesQuery for solving these problems and if we can improve it to 
keep it generic, i'd definitely be in favor of that. But whatever is fastest 
wins :)
   
   I do think its important to add these fields such as KeywordField and put 
this best-practice logic behind simple methods so that it is easier on the user.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

2023-01-17 Thread GitBox


rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1385982331

   I modified the benchmark from #12087 to just use StringField instead of 
IntField. The queries are supposed to be "hard" in that I'm not trying to 
benchmark what is necessarily typical, instead target "hard" stuff that is more 
worst-case (e.g. we shouldn't cause regressions vs `new PointInSetQuery()` in 
main branch today): 
[StringSetBenchmark.java.txt](https://github.com/apache/lucene/files/10438949/StringSetBenchmark.java.txt)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

2023-01-17 Thread GitBox


rmuir commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1072830614


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java:
##
@@ -0,0 +1,527 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.queries;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Objects;
+import java.util.Set;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SortedDocValues;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.TermState;
+import org.apache.lucene.index.Terms;
+import org.apache.lucene.index.TermsEnum;
+import org.apache.lucene.search.ConstantScoreScorer;
+import org.apache.lucene.search.ConstantScoreWeight;
+import org.apache.lucene.search.DisiPriorityQueue;
+import org.apache.lucene.search.DisiWrapper;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchNoDocsQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.QueryVisitor;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.Scorer;
+import org.apache.lucene.search.ScorerSupplier;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.TwoPhaseIterator;
+import org.apache.lucene.search.Weight;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.DocIdSetBuilder;
+import org.apache.lucene.util.LongBitSet;
+import org.apache.lucene.util.PriorityQueue;
+
+public class TermInSetQuery extends Query {
+  // TODO: tunable coefficients. need to actually tune them (or maybe these 
are too complex and not
+  // useful)
+  private static final double J = 1.0;
+  private static final double K = 1.0;
+  // L: postings lists under this threshold will always be "pre-processed" 
into a bitset
+  private static final int L = 512;
+  // M: max number of clauses we'll manage/check during scoring (these remain 
"unprocessed")
+  private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64);
+
+  private final String field;
+  // TODO: Not particularly memory-efficient; could use prefix-coding here but 
sorting isn't free
+  private final BytesRef[] terms;

Review Comment:
   but this is really bad for performance to be unsorted: it means we do a 
bunch of random access lookups in the terms dictionaries: looping over these 
unsorted terms and doing seekExact, looping over these unsorted terms and doing 
lookupOrd that could instead easily be sequential/more friendly.
   
   Given that sometimes looking up all the terms is a very heavy cost for this 
thing, calling `Arrays.sort()` in the ctor seems like an easy-win/no-brainer. 
As is the prefix-coding to keep the RAM usage lower in the worst-case. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

2023-01-17 Thread GitBox


gsmiller commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1072835306


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java:
##
@@ -0,0 +1,527 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.queries;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Objects;
+import java.util.Set;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SortedDocValues;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.TermState;
+import org.apache.lucene.index.Terms;
+import org.apache.lucene.index.TermsEnum;
+import org.apache.lucene.search.ConstantScoreScorer;
+import org.apache.lucene.search.ConstantScoreWeight;
+import org.apache.lucene.search.DisiPriorityQueue;
+import org.apache.lucene.search.DisiWrapper;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchNoDocsQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.QueryVisitor;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.Scorer;
+import org.apache.lucene.search.ScorerSupplier;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.TwoPhaseIterator;
+import org.apache.lucene.search.Weight;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.DocIdSetBuilder;
+import org.apache.lucene.util.LongBitSet;
+import org.apache.lucene.util.PriorityQueue;
+
+public class TermInSetQuery extends Query {
+  // TODO: tunable coefficients. need to actually tune them (or maybe these 
are too complex and not
+  // useful)
+  private static final double J = 1.0;
+  private static final double K = 1.0;
+  // L: postings lists under this threshold will always be "pre-processed" 
into a bitset
+  private static final int L = 512;
+  // M: max number of clauses we'll manage/check during scoring (these remain 
"unprocessed")
+  private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64);
+
+  private final String field;
+  // TODO: Not particularly memory-efficient; could use prefix-coding here but 
sorting isn't free
+  private final BytesRef[] terms;

Review Comment:
   I generally agree. I was testing some interesting use-cases though with our 
current `TermInSetQuery` where we have 10,000+ terms (PK-type field), and the 
sorting it does actually came with a significant cost (and we got a pretty good 
win by removing it). But that's a bit of a different use-case maybe now that I 
think about it. We have a bloom filter in place as well, and a good share of 
the terms aren't actually in the index. So there's that...
   
   OK, yeah, we probably ought to sort here in a general implementation :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

2023-01-17 Thread GitBox


gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1386068784

   @rmuir 
   > I was naively thinking to try to the same approach with the 
DocValuesTermsQuery that is in sandbox...
   
   I think that's probably a good place to start honestly. I was thinking of 
introducing this in the sandbox module initially and not actually hooking it 
into `KeywordField`. Then maybe following up by graduating it and hooking it 
in? But maybe it's worth doing initially if we can get it right? I dunno.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

2023-01-17 Thread GitBox


rmuir commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1072841550


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java:
##
@@ -0,0 +1,527 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.queries;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Objects;
+import java.util.Set;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SortedDocValues;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.TermState;
+import org.apache.lucene.index.Terms;
+import org.apache.lucene.index.TermsEnum;
+import org.apache.lucene.search.ConstantScoreScorer;
+import org.apache.lucene.search.ConstantScoreWeight;
+import org.apache.lucene.search.DisiPriorityQueue;
+import org.apache.lucene.search.DisiWrapper;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchNoDocsQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.QueryVisitor;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.Scorer;
+import org.apache.lucene.search.ScorerSupplier;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.TwoPhaseIterator;
+import org.apache.lucene.search.Weight;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.DocIdSetBuilder;
+import org.apache.lucene.util.LongBitSet;
+import org.apache.lucene.util.PriorityQueue;
+
+public class TermInSetQuery extends Query {
+  // TODO: tunable coefficients. need to actually tune them (or maybe these 
are too complex and not
+  // useful)
+  private static final double J = 1.0;
+  private static final double K = 1.0;
+  // L: postings lists under this threshold will always be "pre-processed" 
into a bitset
+  private static final int L = 512;
+  // M: max number of clauses we'll manage/check during scoring (these remain 
"unprocessed")
+  private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64);
+
+  private final String field;
+  // TODO: Not particularly memory-efficient; could use prefix-coding here but 
sorting isn't free
+  private final BytesRef[] terms;

Review Comment:
   When I think of worst-case, I'm assuming where these term dictionaries don't 
even fit in RAM. We should really always assume this stuff doesn't fit in RAM :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

2023-01-17 Thread GitBox


rmuir commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1072855208


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java:
##
@@ -0,0 +1,527 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.queries;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Objects;
+import java.util.Set;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SortedDocValues;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.TermState;
+import org.apache.lucene.index.Terms;
+import org.apache.lucene.index.TermsEnum;
+import org.apache.lucene.search.ConstantScoreScorer;
+import org.apache.lucene.search.ConstantScoreWeight;
+import org.apache.lucene.search.DisiPriorityQueue;
+import org.apache.lucene.search.DisiWrapper;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchNoDocsQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.QueryVisitor;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.Scorer;
+import org.apache.lucene.search.ScorerSupplier;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.TwoPhaseIterator;
+import org.apache.lucene.search.Weight;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.DocIdSetBuilder;
+import org.apache.lucene.util.LongBitSet;
+import org.apache.lucene.util.PriorityQueue;
+
+public class TermInSetQuery extends Query {
+  // TODO: tunable coefficients. need to actually tune them (or maybe these 
are too complex and not
+  // useful)
+  private static final double J = 1.0;
+  private static final double K = 1.0;
+  // L: postings lists under this threshold will always be "pre-processed" 
into a bitset
+  private static final int L = 512;
+  // M: max number of clauses we'll manage/check during scoring (these remain 
"unprocessed")
+  private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64);
+
+  private final String field;
+  // TODO: Not particularly memory-efficient; could use prefix-coding here but 
sorting isn't free
+  private final BytesRef[] terms;

Review Comment:
   you can use this tool when benchmarking to help make sure index no longer 
fits in RAM: 
https://github.com/mikemccand/luceneutil/blob/b48e7f49b19c27367737436214cc1ce7e67ad32c/src/python/ramhog.c
   
   or you can open up the computer and remove DIMMs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

2023-01-17 Thread GitBox


gsmiller commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1072871141


##
lucene/core/src/java/org/apache/lucene/search/DisiWrapper.java:
##
@@ -57,4 +57,14 @@ public DisiWrapper(Scorer scorer) {
   matchCost = 0f;
 }
   }
+
+  public DisiWrapper(DocIdSetIterator iterator) {

Review Comment:
   This change is common to #12055, so I'm hoping we'd actually land it as part 
of that work and not needed just for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

2023-01-17 Thread GitBox


gsmiller commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1072872477


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java:
##
@@ -0,0 +1,527 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.queries;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Objects;
+import java.util.Set;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SortedDocValues;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.TermState;
+import org.apache.lucene.index.Terms;
+import org.apache.lucene.index.TermsEnum;
+import org.apache.lucene.search.ConstantScoreScorer;
+import org.apache.lucene.search.ConstantScoreWeight;
+import org.apache.lucene.search.DisiPriorityQueue;
+import org.apache.lucene.search.DisiWrapper;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchNoDocsQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.QueryVisitor;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.Scorer;
+import org.apache.lucene.search.ScorerSupplier;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.TwoPhaseIterator;
+import org.apache.lucene.search.Weight;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.DocIdSetBuilder;
+import org.apache.lucene.util.LongBitSet;
+import org.apache.lucene.util.PriorityQueue;
+
+public class TermInSetQuery extends Query {
+  // TODO: tunable coefficients. need to actually tune them (or maybe these 
are too complex and not
+  // useful)
+  private static final double J = 1.0;
+  private static final double K = 1.0;
+  // L: postings lists under this threshold will always be "pre-processed" 
into a bitset
+  private static final int L = 512;
+  // M: max number of clauses we'll manage/check during scoring (these remain 
"unprocessed")
+  private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64);
+
+  private final String field;
+  // TODO: Not particularly memory-efficient; could use prefix-coding here but 
sorting isn't free
+  private final BytesRef[] terms;

Review Comment:
   That's a good point/perspective. I'm convinced. It's easy enough to borrow 
those ideas from our existing `TermInSetQuery`, so I'll do that. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

2023-01-17 Thread GitBox


gsmiller commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1072874867


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java:
##
@@ -0,0 +1,527 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.queries;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Objects;
+import java.util.Set;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SortedDocValues;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.TermState;
+import org.apache.lucene.index.Terms;
+import org.apache.lucene.index.TermsEnum;
+import org.apache.lucene.search.ConstantScoreScorer;
+import org.apache.lucene.search.ConstantScoreWeight;
+import org.apache.lucene.search.DisiPriorityQueue;
+import org.apache.lucene.search.DisiWrapper;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchNoDocsQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.QueryVisitor;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.Scorer;
+import org.apache.lucene.search.ScorerSupplier;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.TwoPhaseIterator;
+import org.apache.lucene.search.Weight;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.DocIdSetBuilder;
+import org.apache.lucene.util.LongBitSet;
+import org.apache.lucene.util.PriorityQueue;
+
+public class TermInSetQuery extends Query {
+  // TODO: tunable coefficients. need to actually tune them (or maybe these 
are too complex and not
+  // useful)
+  private static final double J = 1.0;
+  private static final double K = 1.0;
+  // L: postings lists under this threshold will always be "pre-processed" 
into a bitset
+  private static final int L = 512;
+  // M: max number of clauses we'll manage/check during scoring (these remain 
"unprocessed")
+  private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64);
+
+  private final String field;
+  // TODO: Not particularly memory-efficient; could use prefix-coding here but 
sorting isn't free
+  private final BytesRef[] terms;
+  private final int termsHashCode;
+
+  public TermInSetQuery(String field, Collection terms) {
+this.field = field;
+
+final Set uniqueTerms;
+if (terms instanceof Set) {
+  uniqueTerms = (Set) terms;
+} else {
+  uniqueTerms = new HashSet<>(terms);
+}
+this.terms = new BytesRef[uniqueTerms.size()];
+Iterator it = uniqueTerms.iterator();
+for (int i = 0; i < uniqueTerms.size(); i++) {
+  assert it.hasNext();
+  this.terms[i] = it.next();
+}
+// TODO: compute lazily?
+termsHashCode = Arrays.hashCode(this.terms);
+  }
+
+  @Override
+  public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, 
float boost)
+  throws IOException {
+
+return new ConstantScoreWeight(this, boost) {
+
+  @Override
+  public Scorer scorer(LeafReaderContext context) throws IOException {
+ScorerSupplier supplier = scorerSupplier(context);
+if (supplier == null) {
+  return null;
+} else {
+  return supplier.get(Long.MAX_VALUE);
+}
+  }
+
+  @Override
+  public ScorerSupplier scorerSupplier(LeafReaderContext context) throws 
IOException {
+if (terms.length <= 1) {
+  throw new IllegalStateException("Must call IndexSearcher#rewrite");
+}
+
+// If the field doesn't exist in the segment, return null:
+LeafReader reader = context.reader();
+FieldInfo fi = reader.getFieldInfos().fieldInfo(field);
+if (fi == null) {
+  return null;
+}
+
+return 

[GitHub] [lucene] mulugetam opened a new issue, #12090: Building a Lucene posting format that leverages the Java Vector API

2023-01-17 Thread GitBox


mulugetam opened a new issue, #12090:
URL: https://github.com/apache/lucene/issues/12090

   ### Description
   
   This issue is to start a conversation on implementing a vectorized encoding 
and decoding scheme for postings. 
   
   A few months ago, we implemented vectorized integer compression based on the 
[JavaFastPFOR](https://github.com/lemire/JavaFastPFOR) library. That code has 
since been [merged](https://github.com/lemire/JavaFastPFOR/pull/51). 
Performance results, based on [JMH](https://github.com/openjdk/jmh), show 
[significant gains](https://github.com/mulugetam/VectorJavaFastPFOR) in 
performance compared to the default JavaFastPFOR. 
   
   We would, of course, need to benchmark the vectorized PostingsFormat against 
the existing implementation.
   
   @jpountz  What's your take on it, and how should we go about it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mulugetam opened a new issue, #12091: Speeding up Lucene Vector Similarity through the Java Vector API

2023-01-17 Thread GitBox


mulugetam opened a new issue, #12091:
URL: https://github.com/apache/lucene/issues/12091

   ### Description
   
   Lucene's implementation of ANN relies on a scalar implementation of the 
vector similarity functions 
[dot-product,](https://github.com/apache/lucene/blob/4fe8424925ca404d335fa41d261545d3182c22fa/lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java#L53)
 [Euclidean 
distance](https://github.com/apache/lucene/blob/4fe8424925ca404d335fa41d261545d3182c22fa/lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java#L34),
 and 
[cosine](https://github.com/apache/lucene/blob/4fe8424925ca404d335fa41d261545d3182c22fa/lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java#L71).
 The vector implementation of these functions is quite straightforward. 
   
   Below is performance data I got, based on JMH, comparing the vector 
implementation of the `dot product` and `Euclidean` against the equivalent 
default (scalar with loop-unrolling) implementation. 
   
   `dim` is the dimension/length of the `float[]` arrays in test and `score` is 
the number of dot product/Euclidean distance operations done per second.
   
   ```
   Benchmark   dim  ModeCnt Score   
Units   Gain
   
--
   scalarDotProduct 60  thrpt   1232031825.541 ±   6151.580 
ops/s   1.00
   scalarDotProduct 120 thrpt   1217120537.911 ±   5793.505 
ops/s   1.00
   scalarDotProduct 480 thrpt   12 4506350.215 ±   1677.755 
ops/s   1.00
   vectorDotProduct 60  thrpt   1298862701.038 ±  85554.695 
ops/s   3.09
   vectorDotProduct 120 thrpt   1299059913.888 ±  20609.182 
ops/s   5.79
   vectorDotProduct 480 thrpt   12  220320941.436  ± 173467.603 
ops/s   48.89
   ```
   
   ```
   Benchmark   dim  ModeCnt Score   
Units   Gain
   
--
   scalarSquareDistance 60  thrpt   12  25890614.822 ±  
7071.413 ops/s  1.00
   scalarSquareDistance 120 thrpt   12  12524294.760 ±  
3435.882 ops/s  1.00
   scalarSquareDistance 480 thrpt   12   3145045.026 ±   
409.361 ops/s  1.00
   vectorSquareDistance 60  thrpt   12 104317302.765 ± 
36895.474 ops/s  4.03
   vectorSquareDistance 120 thrpt   12 122083614.889 ± 
11821.642 ops/s  9.75
   vectorSquareDistance 480 thrpt   12 362229408.898 ± 
85439.065 ops/s  115.17
   ```
   
   I have also tested the same with [Msokolov's ANN benchmark 
suite](https://github.com/msokolov/ann-benchmarks) and saw a speedup of more 
than 2x in indexing (docs/sec) and search performance (QPS). Will do a PR for 
it soon.
   
   Let's discuss this :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jebnix commented on issue #11870: Create a Markdown based documentation

2023-01-17 Thread GitBox


jebnix commented on issue #11870:
URL: https://github.com/apache/lucene/issues/11870#issuecomment-1386297416

   @uschindler That's nice, but I personally miss two things about the Lucene 
repo:
   1. The ability to find the documentation in a central place (that makes the 
contribution much easier). That's the way most repositories manage project 
documentation. The Javadoc in my opinion is good for code-related notes, and 
separate docs (that are currently inside `package-info.java`) - inside a 
`docs/` dir.
   2. Some generated **good-looking** documentation site. I suggest Docusaurus 
usage, so the docs will get written in MD, and look like [this for 
example](https://redux.js.org/).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma opened a new pull request, #12092: Remove UTF8TaxonomyWriterCache

2023-01-17 Thread GitBox


vigyasharma opened a new pull request, #12092:
URL: https://github.com/apache/lucene/pull/12092

   As per the discussion in PR #12013, this change removes the never evicting 
`UTF8TaxonomyWriterCache` and uses `LruTaxonomyWriterCache` as the default 
taxonomy writer cache implementation.
   
   Addresses #12000 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on pull request #12013: Clear thread local values on UTF8TaxonomyWriterCache.close()

2023-01-17 Thread GitBox


vigyasharma commented on PR #12013:
URL: https://github.com/apache/lucene/pull/12013#issuecomment-1386545577

   Created a separate PR - #12092 to remove support for 
`UTF8TaxonomyWriterCache` from main. Will close this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma closed pull request #12013: Clear thread local values on UTF8TaxonomyWriterCache.close()

2023-01-17 Thread GitBox


vigyasharma closed pull request #12013: Clear thread local values on 
UTF8TaxonomyWriterCache.close()
URL: https://github.com/apache/lucene/pull/12013


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma merged pull request #12045: fix typo in KoreanNumberFilter

2023-01-17 Thread GitBox


vigyasharma merged PR #12045:
URL: https://github.com/apache/lucene/pull/12045


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma opened a new pull request, #12093: Deprecate support for UTF8TaxonomyWriterCache

2023-01-17 Thread GitBox


vigyasharma opened a new pull request, #12093:
URL: https://github.com/apache/lucene/pull/12093

   As discussed in PR #12013 , deprecating support for 
`UTF8TaxonomyWriterCache` in branch_9x.
   Addresses #12000 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on pull request #12013: Clear thread local values on UTF8TaxonomyWriterCache.close()

2023-01-17 Thread GitBox


vigyasharma commented on PR #12013:
URL: https://github.com/apache/lucene/pull/12013#issuecomment-1386565076

   PR - https://github.com/apache/lucene/pull/12093 to deprecate 
`UTF8TaxonomyWriterCache` in 9.x
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org