[jira] [Commented] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries

2022-06-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17559038#comment-17559038
 ] 

ASF subversion and git services commented on LUCENE-10606:
--

Commit e055e95d3ef404368c1accea95d88f1cf5b48c80 in lucene's branch 
refs/heads/branch_9x from Kaival Parikh
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e055e95d3ef ]

LUCENE-10606: For KnnVectorQuery, optimize case where filter is backed by 
BitSetIterator (#951)

Instead of collecting hit-by-hit using a `LeafCollector`, we break down the
search by instantiating a weight, creating scorers, and checking the underlying
iterator. If it is backed by a `BitSet`, we directly update the reference (as
we won't be editing the `Bits`). Else we can create a new `BitSet` from the
iterator using `BitSet.of`.


> Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed 
> queries
> 
>
> Key: LUCENE-10606
> URL: https://issues.apache.org/jira/browse/LUCENE-10606
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Kaival Parikh
>Priority: Minor
>  Labels: performance
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> While working on this [PR|https://github.com/apache/lucene/pull/932] to add 
> prefilter testing support, we saw that hit collection took a long time for 
> BitSetIterator backed scorers (due to iteration over the entire underlying 
> BitSet, and copying it into an internal one)
> These BitSetIterators can be frequent (as they are used in LRUQueryCache), 
> and bulk collection can be optimized with more knowledge of the underlying 
> iterator



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries

2022-06-27 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17559039#comment-17559039
 ] 

Julie Tibshirani commented on LUCENE-10606:
---

I'm closing this out since we added a basic optimization for this case. We can 
expand on the optimization in future PRs/ issues.

> Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed 
> queries
> 
>
> Key: LUCENE-10606
> URL: https://issues.apache.org/jira/browse/LUCENE-10606
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Kaival Parikh
>Priority: Minor
>  Labels: performance
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> While working on this [PR|https://github.com/apache/lucene/pull/932] to add 
> prefilter testing support, we saw that hit collection took a long time for 
> BitSetIterator backed scorers (due to iteration over the entire underlying 
> BitSet, and copying it into an internal one)
> These BitSetIterators can be frequent (as they are used in LRUQueryCache), 
> and bulk collection can be optimized with more knowledge of the underlying 
> iterator



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries

2022-06-27 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani resolved LUCENE-10606.
---
Resolution: Fixed

> Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed 
> queries
> 
>
> Key: LUCENE-10606
> URL: https://issues.apache.org/jira/browse/LUCENE-10606
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Kaival Parikh
>Priority: Minor
>  Labels: performance
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> While working on this [PR|https://github.com/apache/lucene/pull/932] to add 
> prefilter testing support, we saw that hit collection took a long time for 
> BitSetIterator backed scorers (due to iteration over the entire underlying 
> BitSet, and copying it into an internal one)
> These BitSetIterators can be frequent (as they are used in LRUQueryCache), 
> and bulk collection can be optimized with more knowledge of the underlying 
> iterator



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries

2022-06-27 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10606:
--
Fix Version/s: 9.3

> Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed 
> queries
> 
>
> Key: LUCENE-10606
> URL: https://issues.apache.org/jira/browse/LUCENE-10606
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Kaival Parikh
>Priority: Minor
>  Labels: performance
> Fix For: 9.3
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> While working on this [PR|https://github.com/apache/lucene/pull/932] to add 
> prefilter testing support, we saw that hit collection took a long time for 
> BitSetIterator backed scorers (due to iteration over the entire underlying 
> BitSet, and copying it into an internal one)
> These BitSetIterators can be frequent (as they are used in LRUQueryCache), 
> and bulk collection can be optimized with more knowledge of the underlying 
> iterator



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] iverase merged pull request #2664: [8.11] Backport - LUCENE-9580: Don't introduce collinear edges when splitting polygon

2022-06-27 Thread GitBox


iverase merged PR #2664:
URL: https://github.com/apache/lucene-solr/pull/2664


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9580) Tessellator failure for a certain polygon

2022-06-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17559043#comment-17559043
 ] 

ASF subversion and git services commented on LUCENE-9580:
-

Commit 6a3f50539587cdabe5efe199bc06f6375f1d092a in lucene-solr's branch 
refs/heads/branch_8_11 from Hugo Mercier
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=6a3f5053958 ]

LUCENE-9580: Fix bug in the polygon tessellator when introducing collinear 
edges during polygon splitting (#2452) (#2664)

Co-authored-by: Ignacio Vera 

> Tessellator failure for a certain polygon
> -
>
> Key: LUCENE-9580
> URL: https://issues.apache.org/jira/browse/LUCENE-9580
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.5, 8.6
>Reporter: Iurii Vyshnevskyi
>Assignee: Ignacio Vera
>Priority: Major
> Fix For: 9.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This bug was discovered while using ElasticSearch (checked with versions 
> 7.6.2 and 7.9.2).
> But I've created an isolated test case just for Lucene: 
> [https://github.com/apache/lucene-solr/pull/2006/files]
>  
> The unit test fails with "java.lang.IllegalArgumentException: Unable to 
> Tessellate shape".
>  
> The polygon contains two holes that share the same vertex and one more 
> standalone hole.
> Removing any of them makes the unit test pass. 
>  
> Changing the least significant digit in any coordinate of the "common vertex" 
> in any of two first holes, so that these vertices become different in each 
> hole - also makes unit test pass.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani opened a new pull request, #986: Fix FieldExistsQuery rewrite when all docs have vectors

2022-06-27 Thread GitBox


jtibshirani opened a new pull request, #986:
URL: https://github.com/apache/lucene/pull/986

   Before we were checking the number of vectors in the segment against the 
total
   number of documents in IndexReader. This meant FieldExistsQuery would not
   rewrite to MatchAllDocsQuery when there were multiple segments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani merged pull request #986: Fix FieldExistsQuery rewrite when all docs have vectors

2022-06-27 Thread GitBox


jtibshirani merged PR #986:
URL: https://github.com/apache/lucene/pull/986


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

2022-06-27 Thread GitBox


jpountz commented on code in PR #951:
URL: https://github.com/apache/lucene/pull/951#discussion_r907105526


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, 
Query filter) {
   public Query rewrite(IndexReader reader) throws IOException {
 TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
 
-BitSetCollector filterCollector = null;
+Weight filterWeight = null;
 if (filter != null) {
-  filterCollector = new BitSetCollector(reader.leaves().size());
   IndexSearcher indexSearcher = new IndexSearcher(reader);
   BooleanQuery booleanQuery =
   new BooleanQuery.Builder()
   .add(filter, BooleanClause.Occur.FILTER)
   .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER)
   .build();
-  indexSearcher.search(booleanQuery, filterCollector);
+  Query rewritten = indexSearcher.rewrite(booleanQuery);
+  filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
 }
 
 for (LeafReaderContext ctx : reader.leaves()) {
-  TopDocs results = searchLeaf(ctx, filterCollector);
+  Bits acceptDocs;
+  int cost;
+  if (filterWeight != null) {
+Scorer scorer = filterWeight.scorer(ctx);
+if (scorer != null) {
+  DocIdSetIterator iterator = scorer.iterator();
+  if (iterator instanceof BitSetIterator) {
+acceptDocs = ((BitSetIterator) iterator).getBitSet();
+  } else {
+acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc());
+  }
+  cost = (int) iterator.cost();

Review Comment:
   > I don't see a good way to do this, since liveDocs is not backed by a 
FixedBitSet
   
   FWIW there is no guarantee that liveDocs are backed by a FixedBitSet, but 
the default codec always uses a FixedBitSet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction

2022-06-27 Thread GitBox


jpountz commented on PR #972:
URL: https://github.com/apache/lucene/pull/972#issuecomment-1167025504

   > I feel the effect would be similar?
   
   Indeed, sorry I had misread your code!
   
   > In terms of next steps, I'm wondering if there's a preference between bulk 
scorer and scorer implementations when performance improvement is similar
   
   No, it shouldn't matter. Bulk scorers sometimes help yield better 
performance because it's easier for them to amortize computation across docs, 
but if they don't yield better performance, there's no point in using a bulk 
scorer instead of a regular scorer.
   
   I agree that it looks like a great speedup, we should get this in! The 
benchmark only tests performance of top-level disjunctions of term queries that 
have two clauses. I'd be curious to get performance numbers for queries like 
the below ones to see if we need to fine-tune a bit more when this new scorer 
gets used. Note that I don't think we need to get the performance better for 
all these queries to merge the change, we could start by only using this new 
scorer for the (common) case of a top-level disjunction of 2 term queries, and 
later see if this scorer can handle more disjunctions.
   
   ```
   OrAndHigMedAndHighMed: (+including +looking) (+date +finished) # disjunction 
of conjunctions, which don't have as good score upper bounds as term queries
   OrHighPhraseHighPhrase: "united states" "new york" # disjunction of phrase 
queries, which don't have as good score upper bounds as term queries and are 
slow to advance
   AndHighOrMedMed: +be +(mostly interview) # disjunction within conjunction 
that leads iteration
   AndMedOrHighHigh: +interview +(at united) # disjunction within conjunction 
that doesn't lead iteration
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-27 Thread GitBox


jpountz commented on code in PR #967:
URL: https://github.com/apache/lucene/pull/967#discussion_r907109652


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -382,23 +386,20 @@ public int advance(int target) {
 public boolean advanceExact(int target) throws IOException {
   // needed in IndexSorter#StringSorter
   docID = target;
+  initCount();
   ordUpto = ords.offsets[docID] - 1;
   return ords.offsets[docID] > 0;
 }
 
 @Override
 public long nextOrd() {
-  long ord = ords.ords.get(ordUpto++);
-  if (ord == 0) {
-return NO_MORE_ORDS;
-  } else {
-return ord - 1;
-  }
+  return ords.ords.get(ordUpto++);

Review Comment:
   We should keep returning NO_MORE_ORDS when ords are exhausted.



##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -415,34 +416,43 @@ public BytesRef lookupOrd(long ord) throws IOException {
 public long getValueCount() {
   return in.getValueCount();
 }
+
+private void initCount() {
+  assert docID >= 0;
+  count = (int) ords.growableWriter.get(docID);
+}
   }
 
   static final class DocOrds {
 final long[] offsets;
 final PackedLongValues ords;
+final GrowableWriter growableWriter;

Review Comment:
   Let's call it `docValueCounts` or something like that that better reflects 
what it stores?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-27 Thread GitBox


LuXugang commented on code in PR #967:
URL: https://github.com/apache/lucene/pull/967#discussion_r907126551


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -382,23 +386,20 @@ public int advance(int target) {
 public boolean advanceExact(int target) throws IOException {
   // needed in IndexSorter#StringSorter
   docID = target;
+  initCount();
   ordUpto = ords.offsets[docID] - 1;
   return ords.offsets[docID] > 0;
 }
 
 @Override
 public long nextOrd() {
-  long ord = ords.ords.get(ordUpto++);
-  if (ord == 0) {
-return NO_MORE_ORDS;
-  } else {
-return ord - 1;
-  }
+  return ords.ords.get(ordUpto++);

Review Comment:
   @jpountz , I have already use the new ords iteration style for 
`SortingSortedSetDocValues` in  
https://github.com/apache/lucene/pull/967/commits/2de6d0c071bf3344f8f026f023df22953aab9ee3,
  maybe NO_MORE_ORDS is no longer needer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-27 Thread GitBox


jpountz commented on code in PR #967:
URL: https://github.com/apache/lucene/pull/967#discussion_r907164385


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -382,23 +386,20 @@ public int advance(int target) {
 public boolean advanceExact(int target) throws IOException {
   // needed in IndexSorter#StringSorter
   docID = target;
+  initCount();
   ordUpto = ords.offsets[docID] - 1;
   return ords.offsets[docID] > 0;
 }
 
 @Override
 public long nextOrd() {
-  long ord = ords.ords.get(ordUpto++);
-  if (ord == 0) {
-return NO_MORE_ORDS;
-  } else {
-return ord - 1;
-  }
+  return ords.ords.get(ordUpto++);

Review Comment:
   The problem is that custom DocValuesFormats would no longer work if they 
iterate over values using NO_MORE_ORDS. Maybe it's ok because users who write 
custom codecs are expert users, but I'd rather discuss it on a separate issue 
that do it silently as part of this bug fix?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-27 Thread GitBox


LuXugang commented on code in PR #967:
URL: https://github.com/apache/lucene/pull/967#discussion_r907169606


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -382,23 +386,20 @@ public int advance(int target) {
 public boolean advanceExact(int target) throws IOException {
   // needed in IndexSorter#StringSorter
   docID = target;
+  initCount();
   ordUpto = ords.offsets[docID] - 1;
   return ords.offsets[docID] > 0;
 }
 
 @Override
 public long nextOrd() {
-  long ord = ords.ords.get(ordUpto++);
-  if (ord == 0) {
-return NO_MORE_ORDS;
-  } else {
-return ord - 1;
-  }
+  return ords.ords.get(ordUpto++);

Review Comment:
   Thanks for the explanation!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10622) Prepare complete migration script to GitHub issue from Jira (best effort)

2022-06-27 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17559095#comment-17559095
 ] 

Tomoko Uchida commented on LUCENE-10622:


There were three issues that won't be imported.

[LUCENE-1498]

{code}
[2022-06-26 18:38:25,394] ERROR:import_github_issues: Import GitHub issue 
/mnt/hdd/repo/sandbox-lucene-10557/migration/github-import-data/GH-LUCENE-1498.json
 was failed. status=failed, errors=[{'location': '/issue', 'resource': 'Issue', 
'field': None, 'value': None, 'code': 'error'}]
{code}

Have no idea about the cause. Maybe the body contains character sequences that 
are not acceptable to GitHub. There are only two comments, it'd be easy to 
manually port it.

[LUCENE-4344]

Looks like this is redirected to [SOLR-3769] - no action required.

[LUCENE-5612]

{code}
[2022-06-27 02:28:25,450] ERROR:github_issues_util: Failed to import issue 
LockStressTest fails always with NativeFSLockFactory [LUCENE-5612]; 
status_code=413, message={"message":"Payload too big: 1048576 bytes are 
allowed, 1468832 bytes were 
posted.","documentation_url":"https://docs.github.com/rest"}
{code}

The data size exceeds the API's limit (1MB). I think the long stacktrace in a 
comment is the cause. Maybe we could trim the comments and manually port the 
trimmed comments afterward.

Other 10608 issues were successfully imported in 20 hours (the first pass).


> Prepare complete migration script to GitHub issue from Jira (best effort)
> -
>
> Key: LUCENE-10622
> URL: https://issues.apache.org/jira/browse/LUCENE-10622
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> If we intend to move the history to GitHub, it should be perfect as far as 
> possible - significantly degraded copies of history are harmful, rather than 
> helpful for future contributors, I think.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] alessandrobenedetti commented on pull request #926: VectorSimilarityFunction reverse removal

2022-06-27 Thread GitBox


alessandrobenedetti commented on PR #926:
URL: https://github.com/apache/lucene/pull/926#issuecomment-1167161892

   My plan is to merge tomorrow morning UK time. If you have any additional 
concerns let me know!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-27 Thread GitBox


LuXugang commented on code in PR #967:
URL: https://github.com/apache/lucene/pull/967#discussion_r907325756


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -415,34 +416,43 @@ public BytesRef lookupOrd(long ord) throws IOException {
 public long getValueCount() {
   return in.getValueCount();
 }
+
+private void initCount() {
+  assert docID >= 0;
+  count = (int) ords.growableWriter.get(docID);
+}
   }
 
   static final class DocOrds {
 final long[] offsets;
 final PackedLongValues ords;
+final GrowableWriter growableWriter;

Review Comment:
   Addressed in 
https://github.com/apache/lucene/pull/967/commits/48a1ec2c9e263769c60e24582e69c9cd8d00e382



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-27 Thread GitBox


LuXugang commented on code in PR #967:
URL: https://github.com/apache/lucene/pull/967#discussion_r907329493


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -382,23 +386,20 @@ public int advance(int target) {
 public boolean advanceExact(int target) throws IOException {
   // needed in IndexSorter#StringSorter
   docID = target;
+  initCount();
   ordUpto = ords.offsets[docID] - 1;
   return ords.offsets[docID] > 0;
 }
 
 @Override
 public long nextOrd() {
-  long ord = ords.ords.get(ordUpto++);
-  if (ord == 0) {
-return NO_MORE_ORDS;
-  } else {
-return ord - 1;
-  }
+  return ords.ords.get(ordUpto++);

Review Comment:
   @jpountz  I revert part code to old ord iteration style so that the changes 
in 
https://github.com/apache/lucene/pull/967/commits/8ccc59812e12386dd684e5a0b85a78a0495fcb11
 could be verified, so we could focus on fixing bug first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-27 Thread GitBox


jpountz commented on code in PR #967:
URL: https://github.com/apache/lucene/pull/967#discussion_r907352353


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -415,34 +419,45 @@ public BytesRef lookupOrd(long ord) throws IOException {
 public long getValueCount() {
   return in.getValueCount();
 }
+
+private void initCount() {
+  assert docID >= 0;
+  ordUpto = ords.offsets[docID] - 1;
+  count = (int) ords.docValueCounts.get(docID);
+  limit = ordUpto + count;
+}
   }
 
   static final class DocOrds {
 final long[] offsets;
 final PackedLongValues ords;
+final GrowableWriter docValueCounts;
+
+public static final int START_BITS_PER_VALUE = 2;
 
 DocOrds(
 int maxDoc,
 Sorter.DocMap sortMap,
 SortedSetDocValues oldValues,
-float acceptableOverheadRatio)
+float acceptableOverheadRatio,
+int bitsPerValue)
 throws IOException {
   offsets = new long[maxDoc];
   PackedLongValues.Builder builder = 
PackedLongValues.packedBuilder(acceptableOverheadRatio);
-  long ordOffset = 1; // 0 marks docs with no values
+  docValueCounts = new GrowableWriter(bitsPerValue, maxDoc, 
acceptableOverheadRatio);
+  long ordOffset = 1;

Review Comment:
   Let's start at 0 to not have to subtract 0 all the time in `initCount()`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy

2022-06-27 Thread LuYunCheng (Jira)
LuYunCheng created LUCENE-10627:
---

 Summary: Using CompositeByteBuf to Reduce Memory Copy
 Key: LUCENE-10627
 URL: https://issues.apache.org/jira/browse/LUCENE-10627
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs, core/store
Reporter: LuYunCheng


 

I see When Lucene Do flush and merge store fields, need many memory copies:

 
{code:java}
Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms 
elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable  
[0x7f17718db000]    java.lang.Thread.State: RUNNABLE     at 
org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271)
     at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239)
     at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169)
     at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654)
     at 
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228)     
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)     at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)     at 
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)     at 
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
     at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
     at 
org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
     at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
 {code}
 

 

 

When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many 
memory copies:

With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}:
 # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
 # compressor copy dict and data into one block buffer
 # do compress
 # copy compressed data out

With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}:
 # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
 # do compress
 # copy compressed data out

 

I think we can use CompositeByteBuf to reduce temp memory copies:
 # we do not have to *bufferedDocs.toArrayCopy* when just need continues 
content for chunk compress

 

I write a simple mini benchamrk in test code:
*LZ4WithPresetDict run* Capacity:41943040(bytes) , iter 10times: Origin 
elapse:5391ms , New elapse:5297ms
*DeflateWithPresetDict run* Capacity:41943040(bytes), iter 10times: Origin 
elapse:115ms, New elapse:12ms
 
And I run runStoredFieldsBenchmark with doc_limit=-1:
shows:
||Msec to index||BEST_SPEED ||BEST_COMPRESSION||
|Baseline|318877.00|606288.00|
|Candidate|314442.00|604719.00|



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] luyuncheng opened a new pull request, #987: Using CompositeByteBuf to Reduce Memory Copy

2022-06-27 Thread GitBox


luyuncheng opened a new pull request, #987:
URL: https://github.com/apache/lucene/pull/987

   JIRA: https://issues.apache.org/jira/browse/LUCENE-10627
   I see When Lucene Do flush and merge store fields, need many memory copies:
   ```
   Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms 
elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable  
[0x7f17718db000]
  java.lang.Thread.State: RUNNABLE
at 
org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271)
at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239)
at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169)
at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654)
at 
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
at 
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
at 
org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
   ```
   
   When Lucene CompressingStoredFieldsWriter do flush documents, it needs many 
memory copies:
   
   - With Lucene90 using LZ4WithPresetDictCompressionMode:
   
   1. bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
   2. compressor copy dict and data into one block buffer
   3. do compress
   4. copy compressed data out
   
   - With Lucene90 using DeflateWithPresetDictCompressionMode:
   
   1. bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
   2. do compress
   3. copy compressed data out
   
   I think we can `use CompositeByteBuf` to **reduce temp memory copies** :
   - we do not have to bufferedDocs.toArrayCopy when just need continues 
content for chunk compress
   
   
   I write a simple mini benchamrk in test code:
   LZ4WithPresetDict run Capacity:41943040(bytes) , iter 10times: 
   `Origin elapse:5391ms , New elapse:5297ms`
   DeflateWithPresetDict run Capacity:41943040(bytes), iter 10times: 
   `Origin elapse:115ms, New elapse:12ms`
    
   And I run runStoredFieldsBenchmark with doc_limit=-1:
   shows:
   
   
   Msec to index | BEST_SPEED | BEST_COMPRESSION
   -- | -- | --
   Baseline | 318877.00 | 606288.00
   Candidate | 314442.00 | 604719.00
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy

2022-06-27 Thread LuYunCheng (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LuYunCheng updated LUCENE-10627:

Description: 
Code: https://github.com/apache/lucene/pull/987

I see When Lucene Do flush and merge store fields, need many memory copies:
{code:java}
Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms 
elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable  
[0x7f17718db000]
   java.lang.Thread.State: RUNNABLE
    at 
org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654)
    at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228)
    at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
    at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
    at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
    at 
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
    at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
    at 
org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
    at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
 {code}
When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many 
memory copies:

With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}:
 # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
 # compressor copy dict and data into one block buffer
 # do compress
 # copy compressed data out

With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}:
 # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
 # do compress
 # copy compressed data out

 

I think we can use CompositeByteBuf to reduce temp memory copies:
 # we do not have to *bufferedDocs.toArrayCopy* when just need continues 
content for chunk compress

 

I write a simple mini benchamrk in test code:
*LZ4WithPresetDict run* Capacity:41943040(bytes) , iter 10times: Origin 
elapse:5391ms , New elapse:5297ms
*DeflateWithPresetDict run* Capacity:41943040(bytes), iter 10times: Origin 
elapse:115ms, New elapse:12ms
 
And I run runStoredFieldsBenchmark with doc_limit=-1:
shows:
||Msec to index||BEST_SPEED ||BEST_COMPRESSION||
|Baseline|318877.00|606288.00|
|Candidate|314442.00|604719.00|

  was:
 

I see When Lucene Do flush and merge store fields, need many memory copies:

 
{code:java}
Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms 
elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable  
[0x7f17718db000]    java.lang.Thread.State: RUNNABLE     at 
org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271)
     at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239)
     at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169)
     at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654)
     at 
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228)     
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)     at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)     at 
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)     at 
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
     at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
     at 
org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
     at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
 {code}
 

 

 

When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many 
memory copies:

With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}:
 # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
 # compressor copy dict and data into one block buffer
 # do compress
 # copy compressed data out

With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}:
 # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
 # do compress
 # copy compressed data out

 

I think we c

[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-27 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17559161#comment-17559161
 ] 

Tomoko Uchida commented on LUCENE-10557:


I opened an INFRA issue https://issues.apache.org/jira/browse/INFRA-23421

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy

2022-06-27 Thread LuYunCheng (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LuYunCheng updated LUCENE-10627:

Description: 
Code: [https://github.com/apache/lucene/pull/987]

I see When Lucene Do flush and merge store fields, need many memory copies:
{code:java}
Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms 
elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable  
[0x7f17718db000]
   java.lang.Thread.State: RUNNABLE
    at 
org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654)
    at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228)
    at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
    at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
    at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
    at 
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
    at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
    at 
org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
    at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
 {code}
When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many 
memory copies:

With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}:
 # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
 # compressor copy dict and data into one block buffer
 # do compress
 # copy compressed data out

With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}:
 # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
 # do compress
 # copy compressed data out

 

I think we can use CompositeByteBuf to reduce temp memory copies:
 # we do not have to *bufferedDocs.toArrayCopy* when just need continues 
content for chunk compress

 

I write a simple mini benchamrk in test code:
*LZ4WithPresetDict run* Capacity:41943040(bytes) , iter 10times: Origin 
elapse:5391ms , New elapse:5297ms
*DeflateWithPresetDict run* Capacity:41943040(bytes), iter 10times: Origin 
elapse:{*}115ms{*}, New elapse:{*}12ms{*}
 
And I run runStoredFieldsBenchmark with doc_limit=-1:
shows:
||Msec to index||BEST_SPEED ||BEST_COMPRESSION||
|Baseline|318877.00|606288.00|
|Candidate|314442.00|604719.00|

  was:
Code: https://github.com/apache/lucene/pull/987

I see When Lucene Do flush and merge store fields, need many memory copies:
{code:java}
Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms 
elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable  
[0x7f17718db000]
   java.lang.Thread.State: RUNNABLE
    at 
org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654)
    at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228)
    at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
    at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
    at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
    at 
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
    at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
    at 
org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
    at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
 {code}
When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many 
memory copies:

With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}:
 # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
 # compressor copy dict and data into one block buffer
 # do compress
 # copy compressed data out

With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}:
 # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
 # do compress
 # copy

[GitHub] [lucene] mayya-sharipova commented on pull request #926: VectorSimilarityFunction reverse removal

2022-06-27 Thread GitBox


mayya-sharipova commented on PR #926:
URL: https://github.com/apache/lucene/pull/926#issuecomment-1167351533

   @alessandrobenedetti Thanks for running the tests, the test results look 
good to me. 
   I was also wondering if you have addressed the previous Mike S.'s 
[comment](https://github.com/apache/lucene/pull/926#issuecomment-1164418508). I 
assume that your train files  (e.g.`sift-128-euclidean.hdf5-test `) are not in 
hdf5 format, but just has it in its name. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta opened a new pull request, #988: LUCENE-10557: temprarily enable github issue

2022-06-27 Thread GitBox


mocobeta opened a new pull request, #988:
URL: https://github.com/apache/lucene/pull/988

   This temporarily enables github issue for testing (LUCENE-10557).
   
https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features#Git.asf.yamlfeatures-Repositoryfeatures
   
   After checking it works, I'll re-disable the feature until actual migration.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10571) Monitor alternative "TermFilter" Presearcher for sparse filter fields

2022-06-27 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17559248#comment-17559248
 ] 

Chris M. Hostetter commented on LUCENE-10571:
-

/ping [~romseygeek] ... curious if you have any thoughts on this?

> Monitor alternative "TermFilter" Presearcher for sparse filter fields
> -
>
> Key: LUCENE-10571
> URL: https://issues.apache.org/jira/browse/LUCENE-10571
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/monitor
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: LUCENE-10571.patch
>
>
> One of the things that surprised me the most when looking into how the 
> {{TermFilteredPresearcher}} worked was what happens when Queries and/or 
> Documents do _NOT_  have a value in a configured filter field.
> per the javadocs...
> {quote}Filtering by additional fields can be configured by passing a set of 
> field names. Documents that contain values in those fields will only be 
> checked against \{@link MonitorQuery} instances that have the same 
> fieldname-value mapping in their metadata.
> {quote}
> ...which is straightforward and useful in the tested example where every 
> registered Query has {{"language"}} metadata, and every Document has a 
> {{"language"}} field, but gives unintuitive results when a Query or Document 
> does *NOT* have a {{"language"}}
> A more "intuitive" & useful (in my opinions) implementation would be 
> something that could be documented as ...
> {quote}Filtering by additional fields can be configured by passing a set of 
> field names. Documents that contain values in those fields will only be 
> checked against \{@link MonitorQuery} instances
> that have the same fieldname-value mapping in their metadata or have no 
> mapping for that fieldname.
> Documents that do not contain values in those fields will only be checked 
> against \{@link MonitorQuery} instances that also have no mapping for that 
> fieldname.
> {quote}
> ...ie: instead of being a straight "filter candidate queries by what we find 
> in the filter fields in the documents" we can instead "derive the queries 
> that are viable candidates for each document if we were restricting the set 
> of documents by those values during a "forward search"



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #978: Remove/deprecate obsolete constants in oal.util.Constants; remove code which is no longer executed after Java 9

2022-06-27 Thread GitBox


uschindler commented on PR #978:
URL: https://github.com/apache/lucene/pull/978#issuecomment-1167583777

   I will merge this later this evening unless somebody complains :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-27 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17559276#comment-17559276
 ] 

Michael McCandless commented on LUCENE-10557:
-

{quote}  Jira markup is converted into Markdown for rendering.
 * There are many conversion errors and need close investigation.{quote}
This seems perhaps solvable, relatively quickly – the conversion tool is 
open-source right?  Tables seem flaky ... what other markup?  I can try to dive 
deep on this if I can make some time.  Let's not rush this conversion.
{quote}"attachments" (patches, images, etc) cannot be migrated with basic 
GitHub API functionality.
 * There could be workarounds; e.g. save them in another github repo and 
rewrite attachment links to refer to them.{quote}
I thought the "unofficial" migration API might support attachments?  Or are 
there big problems with using that API?
{quote}As a reference I will migrate existing all issues into a test repository 
in shortly. Hope we can make a decision by looking at it - I mean, I'll be not 
able to further invest my time in this PoC.

I'll post the PoC migration result to the dev list to ask if we should proceed 
with it or not next week.
{quote}
+1!  Thank you for pushing so hard on this [~tomoko]!  Let's not rush the 
decision ... others can try to push your PoC forwards too to improve the 
migration quality.  This is worth the one-time investment.  And hey, maybe we 
enable something that future Jira -> GitHub issues migrations can use.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache

[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-27 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17559285#comment-17559285
 ] 

Uwe Schindler commented on LUCENE-10557:


Once we have done this: Should we rewrite CHANGES.txt and replace all 
LUCENE- links to GITHUB# links?

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-27 Thread Dawid Weiss (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Dawid Weiss commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 I toyed with attachments a bit. 
 
I've modified Tomoko's code a bit so that it fetches attachments for each issue and places is under attachments/LUCENE-xyz/blob.ext. 
I fetched about half of the attachments from Jira and they total ~350MB. So they're quite large but not unbearably large. 
I created a separate test repository (https://github.com/dweiss/lucene-jira-migration), with a subset of attachment blobs and an example issue (https://github.com/dweiss/lucene-jira-migration/issues/1) that links to them via gh-pages service URLs. Seems to work (mime types, etc.). 
The test repository has an orphaned (separate root) branch for just the attachment blobs but they're still downloaded when you clone the master branch (which I kind of hoped could be avoided). This means that we'd have to either ask infra to create a separate repository for the ported attachments or keep those attachments in the main Lucene repository (and pay the price of an extra ~1GB of download size when doing a full clone). 
I didn't check for multiple attachments with the same name (perhaps it's uncommon but definitely possible) - these would have to be saved under a subfolder or something, so that they can be distinguished. 
A mapping of original attachment URLs and new attachment URLs could also be preserved/ written. 
Since the attachments are a git repository, they should be searchable but for some reason it didn't work for me (maybe needs time to update the index). 
  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 

[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-27 Thread Dawid Weiss (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Dawid Weiss edited a comment on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 I toyed with attachments a bit. * I've modified Tomoko's code a bit so that it fetches attachments for each issue and places is under {{{}attachments/LUCENE-xyz/blob.ext{}}}. * I fetched about half of the attachments from Jira and they total ~350MB. So they're quite large but not unbearably large. * I created a separate test repository ( [ https://github.com/dweiss/lucene-jira-migration ] ), with a subset of attachment blobs and an example issue ( [ https://github.com/dweiss/lucene-jira-migration/issues/1 ] ) that links to them via gh-pages service URLs. Seems to work (mime types, etc.). * The test repository has an orphaned (separate root) branch for just the attachment blobs but they're still downloaded when you clone the master branch (which I kind of hoped could be avoided). This means that we'd have to either ask infra to create a separate repository for the ported attachments or keep those attachments in the main Lucene repository (and pay the price of an extra ~1GB of download size when doing a full clone). * I didn't check for multiple attachments with the same name (perhaps it's uncommon but definitely possible) - these would have to be saved under a subfolder or something, so that they can be distinguished. * A mapping of original attachment URLs and new attachment URLs could also be preserved/ written. * Since the attachments are a git repository, they should be searchable but for some reason it didn't work for me (maybe needs time to update the index). This is just an experiment, I don't mean to imply it has to be done (or should). I was just curious as to what's possible.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
   

[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-27 Thread Tomoko Uchida (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Tomoko Uchida commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 Michael McCandless 

This seems perhaps solvable, relatively quickly – the conversion tool is open-source right? Tables seem flaky ... what other markup? I can try to dive deep on this if I can make some time. Let's not rush this conversion.
 Besides tables, even simple bullet lists are broken. I haven't closely looked at it yet, but I suspect there may be problems in the source text (Jira dump). It could be easily fixed once we find the root cause. 

I thought the "unofficial" migration API might support attachments? Or are there big problems with using that API?
 The unofficial import API does not support binaries, you can only import texts to GitHub with official or unofficial APIs. They have to be stored in other places, outside the main repository (a file storage or another repository). 

Let's not rush the decision ... others can try to push your PoC forwards too to improve the migration quality. This is worth the one-time investment. And hey, maybe we enable something that future Jira -> GitHub issues migrations can use.
 I understand we can't push others to make a decision though, a progress report could be useful since I think we have not reached any conclusion yet. As for "others can try to push your PoC forwards too to improve the migration quality", yes it could happen but to be honest I don't expect there are other people who want to be involved in this task. Uwe Schindler 

Once we have done this: Should we rewrite CHANGES.txt and replace all LUCENE- links to GITHUB# links?
 I'm not sure if it should be done. Just for your information the current changes2html.pl supports only Pull Requests, so it should be changed if we want to mention GitHub issues in CHANGES.  Dawid Weiss 

I created a separate test repository (https://github.com/dweiss/lucene-jira-migration), with a subset of attachment blobs and an example issue (https://github.com/dweiss/lucene-jira-migration/issues/1) that links to them via gh-pages service URLs. Seems to work (mime types, etc.).
 Do we need a git repository at all? We won't version control for the files. Is a file storage sufficient and easy to handle if we can have one? 

This means that we'd have to either ask infra to create a separate repository for the ported attachments or keep those attachments in the main Lucene repository (and pay the price of an extra ~1GB of download size when doing a full clone).
 This is actually the main concern to me. Unfortunately I don't think I'll be able to explain our needs and request support from infra team. I'm sure I won't be abl

[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-27 Thread Tomoko Uchida (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Tomoko Uchida edited a comment on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 [~mikemccand]bq. This seems perhaps solvable, relatively quickly – the conversion tool is open-source right?  Tables seem flaky ... what other markup?  I can try to dive deep on this if I can make some time.  Let's not rush this conversion.Besides tables, even simple bullet lists are broken. I haven't closely looked at it yet, but I suspect there may be problems in the source text (Jira dump). It could be easily fixed once we find the root cause.bq. I thought the "unofficial" migration API might support attachments?  Or are there big problems with using that API?The unofficial import API does not support binaries, you can  only  import  only "  texts "  to GitHub with official or unofficial APIs. They have to be stored in other places,  maybe  outside the main repository (a file storage or another repository).bq. Let's not rush the decision ... others can try to push your PoC forwards too to improve the migration quality.  This is worth the one-time investment.  And hey, maybe we enable something that future Jira -> GitHub issues migrations can use.I understand we can't push others to make a decision though, a progress report could be useful since I think we have not reached any conclusion yet.As for "others can try to push your PoC forwards too to improve the migration quality", yes it could happen but to be honest I don't expect there are other people who want to be involved in this task.[~uschindler]bq. Once we have done this: Should we rewrite CHANGES.txt and replace all LUCENE- links to GITHUB# links?I'm not sure if it should be done. Just for your information the current {{changes2html.pl}} supports only Pull Requests, so  it  the script  should be changed if we want to mention GitHub issues in CHANGES.(I have little experience with perl, but I'll take a look if it's needed. Maybe we should also support issues near future.)   [~dweiss]bq. I created a separate test repository (https://github.com/dweiss/lucene-jira-migration), with a subset of attachment blobs and an example issue (https://github.com/dweiss/lucene-jira-migration/issues/1) that links to them via gh-pages service URLs. Seems to work (mime types, etc.).Do we need a git repository at all? We won't  need  version control for the files. Is a file storage sufficient and easy to handle if we can have one?bq. This means that we'd have to either ask infra to create a separate repository for the ported attachments or keep those attachments in the main Lucene repository (and pay the price of an extra ~1GB of download size when doing a full clone).This is actually the main concern to me. Unfortunately I don't think I'll be able to explain our needs and request support from infra team. I'm sure I won't be able to be a good negotiator for this even if I want to. We need another person if we want to pursue pulling up all attachments from jira.  
 

  
 
 
 
 

 
 
 

 
 
   

[GitHub] [lucene] madrob opened a new pull request, #989: Add back-compat indices for 8.11.2

2022-06-27 Thread GitBox


madrob opened a new pull request, #989:
URL: https://github.com/apache/lucene/pull/989

   Regenerated the index manually, not using the wizard. Spent a lot of time 
trying to isolate the failures, but couldn't figure them out. New index seems 
to work but I would appreciate other folks testing it.
   
   Generated from a download of `lucene-8.11.2-src.tgz` with ant 1.9 and java 8.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-27 Thread Tomoko Uchida (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Tomoko Uchida edited a comment on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 [~mikemccand]bq. This seems perhaps solvable, relatively quickly – the conversion tool is open-source right?  Tables seem flaky ... what other markup?  I can try to dive deep on this if I can make some time.  Let's not rush this conversion.Besides tables, even simple bullet lists are broken. I haven't closely looked at it yet, but I suspect there may be problems in the source text (Jira dump). It could be easily fixed once we find the root cause. The script uses this converter https://github.com/catcombo/jira2markdown for PoC; if the cause of the broken markdowns is the tool's bug, there could be other tools or of course, we could write our own parser/converter from the jira markup spec.https://jira.atlassian.com/secure/WikiRendererHelpAction.jspa?section=all bq. I thought the "unofficial" migration API might support attachments?  Or are there big problems with using that API?The unofficial import API does not support binaries, you can import only "texts" to GitHub with official or unofficial APIs. They have to be stored in other places, maybe outside the main repository (a file storage or another repository).bq. Let's not rush the decision ... others can try to push your PoC forwards too to improve the migration quality.  This is worth the one-time investment.  And hey, maybe we enable something that future Jira -> GitHub issues migrations can use.I understand we can't push others to make a decision though, a progress report could be useful since I think we have not reached any conclusion yet.As for "others can try to push your PoC forwards too to improve the migration quality", yes it could happen but to be honest I don't expect there are other people who want to be involved in this task.[~uschindler]bq. Once we have done this: Should we rewrite CHANGES.txt and replace all LUCENE- links to GITHUB# links?I'm not sure if it should be done. Just for your information the current {{changes2html.pl}} supports only Pull Requests, so the script should be changed if we want to mention GitHub issues in CHANGES. (I have little experience with perl, but I'll take a look if it's needed. Maybe we should also support issues near future.)[~dweiss]bq. I created a separate test repository (https://github.com/dweiss/lucene-jira-migration), with a subset of attachment blobs and an example issue (https://github.com/dweiss/lucene-jira-migration/issues/1) that links to them via gh-pages service URLs. Seems to work (mime types, etc.).Do we need a git repository at all? We won't need version control for the files. Is a file storage sufficient and easy to handle if we can have one?bq. This means that we'd have to either ask infra to create a separate repository for the ported attachments or keep those attachments in the main Lucene repository (and pay the price of an extra ~1GB of download size when doing a full clone).This is actually the main concern to me. Unfortunately I don't think I'll be able to explain our needs and request support from infra team. I'm sure I won't be able to be a good negotiator for this even if I want to. We need another person if we want to pursue pulling up all attachments from jira.  
 

  
 
 
 
 

 

[GitHub] [lucene] gsmiller commented on a diff in pull request #974: LUCENE-10614: Properly support getTopChildren in RangeFacetCounts

2022-06-27 Thread GitBox


gsmiller commented on code in PR #974:
URL: https://github.com/apache/lucene/pull/974#discussion_r907916354


##
lucene/facet/src/java/org/apache/lucene/facet/range/RangeFacetCounts.java:
##
@@ -232,20 +233,43 @@ public FacetResult getAllChildren(String dim, String... 
path) throws IOException
 return new FacetResult(dim, path, totCount, labelValues, 
labelValues.length);
   }
 
-  // The current getTopChildren method is not returning "top" ranges. Instead, 
it returns all
-  // user-provided ranges in
-  // the order the user specified them when instantiating. This concept is 
being introduced and
-  // supported in the
-  // getAllChildren functionality in LUCENE-10550. getTopChildren is 
temporarily calling
-  // getAllChildren to maintain its
-  // current behavior, and the current implementation will be replaced by an 
actual "top children"
-  // implementation
-  // in LUCENE-10614
-  // TODO: fix getTopChildren in LUCENE-10614
   @Override
   public FacetResult getTopChildren(int topN, String dim, String... path) 
throws IOException {
 validateTopN(topN);
-return getAllChildren(dim, path);
+validateDimAndPathForGetChildren(dim, path);
+
+int resultSize = Math.min(topN, counts.length);
+PriorityQueue pq =
+new PriorityQueue<>(resultSize) {
+  @Override
+  protected boolean lessThan(LabelAndValue a, LabelAndValue b) {
+int cmp = Integer.compare(a.value.intValue(), b.value.intValue());
+if (cmp == 0) {
+  cmp = b.label.compareTo(a.label);
+}
+return cmp < 0;
+  }
+};
+
+for (int i = 0; i < counts.length; i++) {
+  if (pq.size() < resultSize) {
+pq.add(new LabelAndValue(ranges[i].label, counts[i]));

Review Comment:
   I wonder if we should only add to the pq when the count is > 0 to be 
consistent with other Facet implementations. What do you think?



##
lucene/demo/src/java/org/apache/lucene/demo/facet/DistanceFacetsExample.java:
##
@@ -212,7 +212,26 @@ public static Query getBoundingBoxQuery(
   }
 
   /** User runs a query and counts facets. */
-  public FacetResult search() throws IOException {
+  public FacetResult searchAllChildren() throws IOException {
+
+FacetsCollector fc = searcher.search(new MatchAllDocsQuery(), new 
FacetsCollectorManager());
+
+Facets facets =
+new DoubleRangeFacetCounts(
+"field",
+getDistanceValueSource(),
+fc,
+getBoundingBoxQuery(ORIGIN_LATITUDE, ORIGIN_LONGITUDE, 10.0),
+ONE_KM,
+TWO_KM,
+FIVE_KM,
+TEN_KM);
+
+return facets.getAllChildren("field");
+  }
+
+  /** User runs a query and counts facets. */
+  public FacetResult searchTopChildren() throws IOException {

Review Comment:
   I'm not totally sold we need to demo the `getTopChildren` functionality. It 
feels like it will be a little obscure for range faceting to me. What do you 
think of just changing the existing example code in-place to use 
`getAllChildren` instead of `getTopChildren` since that's probably the more 
common use-case? Curious what you think though. Do you think we should demo 
`getTopChildren` as well?



##
lucene/facet/src/java/org/apache/lucene/facet/range/RangeFacetCounts.java:
##
@@ -232,20 +233,43 @@ public FacetResult getAllChildren(String dim, String... 
path) throws IOException
 return new FacetResult(dim, path, totCount, labelValues, 
labelValues.length);
   }
 
-  // The current getTopChildren method is not returning "top" ranges. Instead, 
it returns all
-  // user-provided ranges in
-  // the order the user specified them when instantiating. This concept is 
being introduced and
-  // supported in the
-  // getAllChildren functionality in LUCENE-10550. getTopChildren is 
temporarily calling
-  // getAllChildren to maintain its
-  // current behavior, and the current implementation will be replaced by an 
actual "top children"
-  // implementation
-  // in LUCENE-10614
-  // TODO: fix getTopChildren in LUCENE-10614
   @Override
   public FacetResult getTopChildren(int topN, String dim, String... path) 
throws IOException {
 validateTopN(topN);
-return getAllChildren(dim, path);
+validateDimAndPathForGetChildren(dim, path);
+
+int resultSize = Math.min(topN, counts.length);
+PriorityQueue pq =
+new PriorityQueue<>(resultSize) {
+  @Override
+  protected boolean lessThan(LabelAndValue a, LabelAndValue b) {
+int cmp = Integer.compare(a.value.intValue(), b.value.intValue());
+if (cmp == 0) {
+  cmp = b.label.compareTo(a.label);
+}
+return cmp < 0;
+  }
+};
+
+for (int i = 0; i < counts.length; i++) {
+  if (pq.size() < resultSize) {
+pq.add(new LabelAndValue(ranges[i].label, counts[i]));
+  } else {
+int topValue = pq.top().value.intVa

[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-27 Thread Tomoko Uchida (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Tomoko Uchida commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 As for the attachments, just a rough idea... perhaps we could have these files in our personal space under "https://home.apache.org/~user"? I have never used this space so "https://home.apache.org/~tomoko" is still empty. I don't know what maximum storage size is allowed per user, if it is too small to store whole data, we could distribute them to multiple accounts.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[GitHub] [lucene] LuXugang merged pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-27 Thread GitBox


LuXugang merged PR #967:
URL: https://github.com/apache/lucene/pull/967


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-27 Thread GitBox


LuXugang commented on code in PR #967:
URL: https://github.com/apache/lucene/pull/967#discussion_r908003266


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -415,34 +419,45 @@ public BytesRef lookupOrd(long ord) throws IOException {
 public long getValueCount() {
   return in.getValueCount();
 }
+
+private void initCount() {
+  assert docID >= 0;
+  ordUpto = ords.offsets[docID] - 1;
+  count = (int) ords.docValueCounts.get(docID);
+  limit = ordUpto + count;
+}
   }
 
   static final class DocOrds {
 final long[] offsets;
 final PackedLongValues ords;
+final GrowableWriter docValueCounts;
+
+public static final int START_BITS_PER_VALUE = 2;
 
 DocOrds(
 int maxDoc,
 Sorter.DocMap sortMap,
 SortedSetDocValues oldValues,
-float acceptableOverheadRatio)
+float acceptableOverheadRatio,
+int bitsPerValue)
 throws IOException {
   offsets = new long[maxDoc];
   PackedLongValues.Builder builder = 
PackedLongValues.packedBuilder(acceptableOverheadRatio);
-  long ordOffset = 1; // 0 marks docs with no values
+  docValueCounts = new GrowableWriter(bitsPerValue, maxDoc, 
acceptableOverheadRatio);
+  long ordOffset = 1;

Review Comment:
   Thanks, I saw `SortingSortedNumericDocValues` has the same logic, maybe we 
could fix it on a separate issues .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10623) Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-27 Thread ASF subversion and git services (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 ASF subversion and git services commented on  LUCENE-10623  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Error implementation of docValueCount for SortingSortedSetDocValues   
 

  
 
 
 
 

 
 Commit d8fb47b67480afe5fffca68f1565774ef6874d60 in lucene's branch refs/heads/main from Lu Xugang [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d8fb47b6748 ] LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues (#967)  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-06-27 Thread Lu Xugang (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Lu Xugang commented on  LUCENE-10603  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Improve iteration of ords for SortedSetDocValues   
 

  
 
 
 
 

 
 Hi Greg Miller  ,LUCENE-10623 was resolved, we could continue to work on this issue If you have some free time recently.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Resolved] (LUCENE-10623) Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-27 Thread Lu Xugang (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Lu Xugang resolved as Fixed  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Lucene - Core /  LUCENE-10623  
 
 
  Error implementation of docValueCount for SortingSortedSetDocValues   
 

  
 
 
 
 

 
Change By: 
 Lu Xugang  
 
 
Resolution: 
 Fixed  
 
 
Status: 
 Open Resolved  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[GitHub] [lucene] LuXugang merged pull request #990: Add entry

2022-06-27 Thread GitBox


LuXugang merged PR #990:
URL: https://github.com/apache/lucene/pull/990


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction

2022-06-27 Thread GitBox


zacharymorn commented on PR #972:
URL: https://github.com/apache/lucene/pull/972#issuecomment-1168197720

   > > I feel the effect would be similar?
   > 
   > Indeed, sorry I had misread your code!
   > 
   
   No worry, thanks still for the suggestion!
   
   > 
   > No, it shouldn't matter. Bulk scorers sometimes help yield better 
performance because it's easier for them to amortize computation across docs, 
but if they don't yield better performance, there's no point in using a bulk 
scorer instead of a regular scorer.
   
   Ok I see, makes sense.
   
   
   > I agree that it looks like a great speedup, we should get this in! The 
benchmark only tests performance of top-level disjunctions of term queries that 
have two clauses. I'd be curious to get performance numbers for queries like 
the below ones to see if we need to fine-tune a bit more when this new scorer 
gets used. Note that I don't think we need to get the performance better for 
all these queries to merge the change, we could start by only using this new 
scorer for the (common) case of a top-level disjunction of 2 term queries, and 
later see if this scorer can handle more disjunctions.
   > 
   > ```
   > OrAndHigMedAndHighMed: (+including +looking) (+date +finished) # 
disjunction of conjunctions, which don't have as good score upper bounds as 
term queries
   > OrHighPhraseHighPhrase: "united states" "new york" # disjunction of phrase 
queries, which don't have as good score upper bounds as term queries and are 
slow to advance
   > AndHighOrMedMed: +be +(mostly interview) # disjunction within conjunction 
that leads iteration
   > AndMedOrHighHigh: +interview +(at united) # disjunction within conjunction 
that doesn't lead iteration
   > ```
   
   Sounds good! I have run these queries through benchmark and the results look 
somewhat consistent:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 OrHighPhraseHighPhrase   28.89  (8.7%)   24.19  
(4.7%)  -16.3% ( -27% -   -3%) 0.000
AndHighOrMedMed  101.24  (6.6%)  101.09  
(3.0%)   -0.1% (  -9% -   10%) 0.927
   AndMedOrHighHigh   81.44  (6.3%)   81.62  
(3.7%)0.2% (  -9% -   10%) 0.895
  OrAndHigMedAndHighMed  128.26  (7.0%)  136.94  
(3.7%)6.8% (  -3% -   18%) 0.000
   PKLookup  221.47 (11.7%)  236.93  
(9.1%)7.0% ( -12% -   31%) 0.035
   ```
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 OrHighPhraseHighPhrase   27.73  (9.1%)   23.73  
(4.6%)  -14.4% ( -25% -0%) 0.000
AndHighOrMedMed   97.09 (13.1%)   99.30  
(4.3%)2.3% ( -13% -   22%) 0.462
   AndMedOrHighHigh   75.87 (15.2%)   80.04  
(5.7%)5.5% ( -13% -   31%) 0.128
   PKLookup  219.70 (15.7%)  238.75 
(12.4%)8.7% ( -16% -   43%) 0.053
  OrAndHigMedAndHighMed  121.83 (13.7%)  134.79  
(4.4%)   10.6% (  -6% -   33%) 0.001
   ```
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 OrHighPhraseHighPhrase   27.42 (16.2%)   23.99  
(4.0%)  -12.5% ( -28% -9%) 0.001
AndHighOrMedMed   96.61 (15.8%)  100.09  
(3.6%)3.6% ( -13% -   27%) 0.321
   AndMedOrHighHigh   75.72 (16.8%)   79.53  
(4.9%)5.0% ( -14% -   32%) 0.200
  OrAndHigMedAndHighMed  122.33 (16.9%)  136.60  
(4.5%)   11.7% (  -8% -   39%) 0.003
   PKLookup  207.94 (21.6%)  233.10 
(16.5%)   12.1% ( -21% -   63%) 0.046
   ```
   
   Looks like we may need to restrict the scorer to only term queries, or 
improve it for phrase queries? 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction

2022-06-27 Thread GitBox


zacharymorn commented on PR #972:
URL: https://github.com/apache/lucene/pull/972#issuecomment-1168202563

   For `OrHighPhraseHighPhrase`, the JFR CPU sampling result looks similar, but 
with the modified version calling `advanceShallow` more often, suggesting the 
BMM implementation might be doing boundary adjustment more often? 
   
   Modified:
   ```
   PERCENT   CPU SAMPLES   STACK
   8.63% 1389  
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#advance()
   5.24% 843   
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#advanceShallow()
   3.18% 511   java.nio.DirectByteBuffer#get()
   2.79% 449   
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
   2.72% 438   
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#refillPositions()
   2.48% 399   
jdk.internal.misc.ScopedMemoryAccess#getByteInternal()
   2.19% 353   
org.apache.lucene.search.ExactPhraseMatcher$1#getImpacts()
   2.11% 339   org.apache.lucene.search.PhraseScorer$1#matches()
   2.06% 331   
org.apache.lucene.codecs.lucene90.Lucene90ScoreSkipReader#skipTo()
   1.83% 294   
org.apache.lucene.search.ExactPhraseMatcher$1$1#getImpacts()
   1.63% 263   
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#nextPosition()
   1.49% 240   org.apache.lucene.store.ByteBufferGuard#getByte()
   1.24% 200   
org.apache.lucene.search.ExactPhraseMatcher#advancePosition()
   1.24% 200   org.apache.lucene.search.ConjunctionDISI#doNext()
   1.21% 194   java.util.zip.Inflater#inflateBytesBytes()
   1.18% 190   
org.apache.lucene.search.ExactPhraseMatcher#nextMatch()
   1.13% 182   org.apache.lucene.store.DataInput#readVLong()
   1.12% 181   
org.apache.lucene.search.ExactPhraseMatcher$1#advanceShallow()
   1.11% 178   
org.apache.lucene.search.ImpactsDISI#advanceShallow()
   1.07% 172   
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#skipPositions()
   0.89% 143   java.lang.Class#isArray()
   0.81% 131   
org.apache.lucene.codecs.lucene90.ForUtil#expand8()
   0.75% 121   
org.apache.lucene.search.TwoPhaseIterator$TwoPhaseIteratorAsDocIdSetIterator#doNext()
   0.74% 119   
org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
   0.71% 115   org.apache.lucene.search.ConjunctionDISI#docID()
   0.71% 115   
org.apache.lucene.codecs.lucene90.ForUtil#shiftLongs()
   0.70% 113   org.apache.lucene.search.PhraseScorer#docID()
   0.70% 112   
org.apache.lucene.codecs.lucene90.PForUtil#decode()
   0.68% 110   
org.apache.lucene.search.ExactPhraseMatcher#maxFreq()
   0.68% 109   org.apache.lucene.search.ImpactsDISI#docID()
   ```
   
   Baseline:
   ```
   PERCENT   CPU SAMPLES   STACK
   8.66% 1196  
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#advance()
   3.88% 536   
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
   2.96% 409   java.nio.DirectByteBuffer#get()
   2.78% 384   
org.apache.lucene.search.ExactPhraseMatcher$1$1#getImpacts()
   2.50% 345   
org.apache.lucene.codecs.lucene90.Lucene90ScoreSkipReader#skipTo()
   2.46% 340   
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#advanceShallow()
   1.73% 239   org.apache.lucene.search.PhraseScorer$1#matches()
   1.72% 237   
org.apache.lucene.search.ExactPhraseMatcher$1#getImpacts()
   1.48% 204   java.util.zip.Inflater#inflateBytesBytes()
   1.48% 204   
org.apache.lucene.codecs.lucene90.ForUtil#expand8()
   1.23% 170   
jdk.internal.misc.ScopedMemoryAccess#getByteInternal()
   1.21% 167   org.apache.lucene.search.ConjunctionDISI#doNext()
   1.20% 166   
org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
   1.19% 165   org.apache.lucene.store.ByteBufferGuard#getByte()
   1.12% 155   
org.apache.lucene.codecs.lucene90.PForUtil#prefixSum32()
   1.07% 148   java.lang.Class#isArray()
   1.06% 147   
org.apache.lucene.codecs.lucene90.PForUtil#expand32()
   0.98% 135   
org.apache.lucene.codecs.lucene90.PForUtil#decode()
   0.96% 133   org.apache.lucene.search.ConjunctionDISI#docID()
   0.91% 125   
org.apache.lucene.search.Exac

[GitHub] [lucene] Yuti-G commented on a diff in pull request #974: LUCENE-10614: Properly support getTopChildren in RangeFacetCounts

2022-06-27 Thread GitBox


Yuti-G commented on code in PR #974:
URL: https://github.com/apache/lucene/pull/974#discussion_r908038451


##
lucene/demo/src/java/org/apache/lucene/demo/facet/DistanceFacetsExample.java:
##
@@ -212,7 +212,26 @@ public static Query getBoundingBoxQuery(
   }
 
   /** User runs a query and counts facets. */
-  public FacetResult search() throws IOException {
+  public FacetResult searchAllChildren() throws IOException {
+
+FacetsCollector fc = searcher.search(new MatchAllDocsQuery(), new 
FacetsCollectorManager());
+
+Facets facets =
+new DoubleRangeFacetCounts(
+"field",
+getDistanceValueSource(),
+fc,
+getBoundingBoxQuery(ORIGIN_LATITUDE, ORIGIN_LONGITUDE, 10.0),
+ONE_KM,
+TWO_KM,
+FIVE_KM,
+TEN_KM);
+
+return facets.getAllChildren("field");
+  }
+
+  /** User runs a query and counts facets. */
+  public FacetResult searchTopChildren() throws IOException {

Review Comment:
   I do not have a strong opinion about this. GetAllChildren does make more 
sense for range faceting. I will replace the  getTopChildren with 
getAllChildren in the original demo. Thanks! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10623) Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-27 Thread ASF subversion and git services (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 ASF subversion and git services commented on  LUCENE-10623  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Error implementation of docValueCount for SortingSortedSetDocValues   
 

  
 
 
 
 

 
 Commit fb261e6ff48e5a57d9dff7fd960e21ec2634294d in lucene's branch refs/heads/branch_9x from Lu Xugang [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=fb261e6ff48 ] LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues (#967)  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[GitHub] [lucene] alessandrobenedetti commented on pull request #926: VectorSimilarityFunction reverse removal

2022-06-27 Thread GitBox


alessandrobenedetti commented on PR #926:
URL: https://github.com/apache/lucene/pull/926#issuecomment-1168280355

   > "I was also wondering if you have addressed the previous Mike S.'s 
https://github.com/apache/lucene/pull/926#issuecomment-1164418508. I assume 
that your train files (e.g.sift-128-euclidean.hdf5-test ) are not in hdf5 
format, but just called like this"
   
   yes @mayya-sharipova , the latest benchmarks reported used the 
pre-processing @msokolov suggested.
   That's just the name of the file  that's automatically generated by that 
script :) 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a diff in pull request #974: LUCENE-10614: Properly support getTopChildren in RangeFacetCounts

2022-06-27 Thread GitBox


Yuti-G commented on code in PR #974:
URL: https://github.com/apache/lucene/pull/974#discussion_r908084991


##
lucene/facet/src/java/org/apache/lucene/facet/range/RangeFacetCounts.java:
##
@@ -232,20 +233,43 @@ public FacetResult getAllChildren(String dim, String... 
path) throws IOException
 return new FacetResult(dim, path, totCount, labelValues, 
labelValues.length);
   }
 
-  // The current getTopChildren method is not returning "top" ranges. Instead, 
it returns all
-  // user-provided ranges in
-  // the order the user specified them when instantiating. This concept is 
being introduced and
-  // supported in the
-  // getAllChildren functionality in LUCENE-10550. getTopChildren is 
temporarily calling
-  // getAllChildren to maintain its
-  // current behavior, and the current implementation will be replaced by an 
actual "top children"
-  // implementation
-  // in LUCENE-10614
-  // TODO: fix getTopChildren in LUCENE-10614
   @Override
   public FacetResult getTopChildren(int topN, String dim, String... path) 
throws IOException {
 validateTopN(topN);
-return getAllChildren(dim, path);
+validateDimAndPathForGetChildren(dim, path);
+
+int resultSize = Math.min(topN, counts.length);
+PriorityQueue pq =
+new PriorityQueue<>(resultSize) {
+  @Override
+  protected boolean lessThan(LabelAndValue a, LabelAndValue b) {
+int cmp = Integer.compare(a.value.intValue(), b.value.intValue());
+if (cmp == 0) {
+  cmp = b.label.compareTo(a.label);
+}
+return cmp < 0;
+  }
+};
+
+for (int i = 0; i < counts.length; i++) {
+  if (pq.size() < resultSize) {
+pq.add(new LabelAndValue(ranges[i].label, counts[i]));

Review Comment:
   In this case, I propose we also change the `getAllChildren` functionality in 
RangeFacetCounts to populate LabelAndValue only when count is > 0 to be 
consistent with `getAllChildren` in other Facet implementations. Please let me 
know what you think. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a diff in pull request #974: LUCENE-10614: Properly support getTopChildren in RangeFacetCounts

2022-06-27 Thread GitBox


Yuti-G commented on code in PR #974:
URL: https://github.com/apache/lucene/pull/974#discussion_r908084991


##
lucene/facet/src/java/org/apache/lucene/facet/range/RangeFacetCounts.java:
##
@@ -232,20 +233,43 @@ public FacetResult getAllChildren(String dim, String... 
path) throws IOException
 return new FacetResult(dim, path, totCount, labelValues, 
labelValues.length);
   }
 
-  // The current getTopChildren method is not returning "top" ranges. Instead, 
it returns all
-  // user-provided ranges in
-  // the order the user specified them when instantiating. This concept is 
being introduced and
-  // supported in the
-  // getAllChildren functionality in LUCENE-10550. getTopChildren is 
temporarily calling
-  // getAllChildren to maintain its
-  // current behavior, and the current implementation will be replaced by an 
actual "top children"
-  // implementation
-  // in LUCENE-10614
-  // TODO: fix getTopChildren in LUCENE-10614
   @Override
   public FacetResult getTopChildren(int topN, String dim, String... path) 
throws IOException {
 validateTopN(topN);
-return getAllChildren(dim, path);
+validateDimAndPathForGetChildren(dim, path);
+
+int resultSize = Math.min(topN, counts.length);
+PriorityQueue pq =
+new PriorityQueue<>(resultSize) {
+  @Override
+  protected boolean lessThan(LabelAndValue a, LabelAndValue b) {
+int cmp = Integer.compare(a.value.intValue(), b.value.intValue());
+if (cmp == 0) {
+  cmp = b.label.compareTo(a.label);
+}
+return cmp < 0;
+  }
+};
+
+for (int i = 0; i < counts.length; i++) {
+  if (pq.size() < resultSize) {
+pq.add(new LabelAndValue(ranges[i].label, counts[i]));

Review Comment:
   In this case, I propose we also change the `getAllChildren` functionality in 
RangeFacetCounts to populate LabelAndValue only when count is > 0 to be 
consistent with `getAllChildren` in other Facet implementations. Since if top-N 
is equal to all, we should return the same results from getAllChildren and 
getTopChildren. Please let me know what you think. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-27 Thread Dawid Weiss (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Dawid Weiss commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 > Do we need a git repository at all? We won't need version control for the files. Is a file storage sufficient and easy to handle if we can have one? My hope was that these attachments could be stored in the primary git repository for convenience - keeping the historical artifacts together and having them served for free via github's infrastructure. It's also just convenient as it can be modified/ updated by multiple people (and those same people can freeze the repository for updates, once the migration is complete). Having those artifacts elsewhere (on home.apache.org) lacks some of these conveniences but it's fine too, of course. Also, I don't think infra will have any problem in adding a repository called "lucene-archives" or something like this. I can ask if we decide to push in this direction.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)