[GitHub] [lucene] gautamworah96 opened a new pull request, #922: Index only the docs for FacetField posting list

2022-05-24 Thread GitBox


gautamworah96 opened a new pull request, #922:
URL: https://github.com/apache/lucene/pull/922

   ### Description (or a Jira issue link if you have one)
   
   Change the index option for FacetField to just index the DOCS and not the 
frequencies and offsets (we don't use these values). I still need to think a 
bit more about the long term implications of this change. Opening this PR as a 
starting point for now.
   
   ### Tests
   
   Existing tests pass. I looked for all instances where the code was 
traversing through the postings list in the facet module and none of them were 
relying on the frequency of the term (this is something that confused me at 
first, do we have no use cases for `FacetField` that rely on the frequency? 
apparently not.. If a use case does come up we could change the field type 
again).
   
   
   ### Benchmarks
   
   I ran the `wikimediumall` benchmark with a modified localrun.py script that 
used a new index for the candidate and it did not show any sizeable QPS 
changes. Size of the `facet` index also remained the same at 64 MB.
   
   
   ### Backwards compatibility
   
   I've not thought about this in depth (yet). Existing back-compat tests in 
branch_9x pass.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10586) Minor refactoring in Lucene90BlockTreeTermsReader local variables: metaIn, indexMetaIn, termsMetaIn

2022-05-24 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-10586.

Fix Version/s: 9.3
 Assignee: Tomoko Uchida
   Resolution: Fixed

Thank you both!

> Minor refactoring in Lucene90BlockTreeTermsReader local variables: metaIn, 
> indexMetaIn, termsMetaIn
> ---
>
> Key: LUCENE-10586
> URL: https://issues.apache.org/jira/browse/LUCENE-10586
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Trivial
> Fix For: 9.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Those three local variables refer to the same {{IndexInput}} object (no 
> clone() is called).
> {code}
> indexMetaIn = termsMetaIn = metaIn;
> {code}
> I'm not sure but maybe there are some historical reasons. I wonder if it 
> would be better to have only one reference for the underlying {{IndexInput}} 
> object to make it a little easy to follow the code.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta opened a new pull request, #923: Replace classpath with modulepath in the demo tutorial

2022-05-24 Thread GitBox


mocobeta opened a new pull request, #923:
URL: https://github.com/apache/lucene/pull/923

   ### Description (or a Jira issue link if you have one)
   
   This is a minor update for `demo` module documentation.
   I had a chance to run the demo app and noticed that commands in the tutorial 
use `CLASSPATH` (and the description has been outdated); maybe it'd be worth 
updating to use modulepath instead of classpath?
   
   The latest tutorial:
   ![Screenshot from 2022-05-24 
19-28-13](https://user-images.githubusercontent.com/1825333/170011199-2d13f34c-ef36-46c5-b5fb-e297c7d0510e.png)
   
   Updated tutorial in this patch:
   ![Screenshot from 2022-05-24 
19-28-45](https://user-images.githubusercontent.com/1825333/170011328-cdde36d8-a17d-449f-ad8b-4e7909c8dde6.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #923: Replace classpath with modulepath in the demo tutorial

2022-05-24 Thread GitBox


mocobeta commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1135776812

   note: `modules-thirdparty` should be included on the module path since it 
depends on `hppc` via `lucene-facet`. Also it seems good to add `--add-modules 
jdk.unsupported` when running `SearchFiles`, otherwise, it emits (a little 
intimidating) "WARNING: Unmapping is not supported, ..." message. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #923: Replace classpath with modulepath in the demo tutorial

2022-05-24 Thread GitBox


msokolov commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1135853409

   I found a few typos; if you have a moment maybe you could fix while you're 
updating?
   
   I'm also curious - is it still possible to use the old way (with CLASSPATH / 
-cp)? Also, is MODULEPATH an environment variable like CLASSPATH? If it is, do 
we also need the command-line arguments as shown?  I guess the previous 
documentation didn't really explain how to set the CLASSPATH, but given that 
MODULEPATH is a new thing and many users may be unfamiliar with it, maybe we 
should take the chance to educate a bit how to use it here. 
  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #923: Replace classpath with modulepath in the demo tutorial

2022-05-24 Thread GitBox


mocobeta commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1135877621

   >  Also, is MODULEPATH an environment variable like CLASSPATH? 
   
   Thanks, it's a very good point; to the best of my knowledge, there is no 
environment variable to implicitly set module paths such as `MODULEPATH`. The 
fully capitalized term is misleading, I updated it to the normal "module path".
   
   > I'm also curious - is it still possible to use the old way (with CLASSPATH 
/ -cp)?
   
   Yes, classpath will continue to remain and I think it's unlikely to happen 
to be dropped.
   
   > I found a few typos; if you have a moment maybe you could fix while you're 
updating?
   
   Sure, could you tell me the lines we should fix?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ywelsch commented on a diff in pull request #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

2022-05-24 Thread GitBox


ywelsch commented on code in PR #910:
URL: https://github.com/apache/lucene/pull/910#discussion_r880463557


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##
@@ -589,4 +589,52 @@ public SimScorer scorer(
   return new BM25Similarity().scorer(boost, collectionStats, termStats);
 }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {
+Directory dir = newDirectory();
+IndexWriterConfig iwc = new IndexWriterConfig();
+iwc.setSimilarity(randomCompatibleSimilarity());
+RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+String queryString = "foo";
+
+Document doc0 = new Document();
+doc0.add(new TextField("f", "foo", Store.NO));
+doc0.add(new TextField("g", "foo baz", Store.NO));
+w.addDocument(doc0);
+
+IndexReader reader = w.getReader();
+IndexSearcher searcher =
+new IndexSearcher(reader) {
+  @Override
+  public CollectionStatistics collectionStatistics(String field) 
throws IOException {
+CollectionStatistics shardStatistics = 
super.collectionStatistics(field);
+int extraMaxDoc = randomIntBetween(0, 10);
+int extraDocCount = randomIntBetween(0, extraMaxDoc);
+int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10);
+int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0, 
10);
+CollectionStatistics globalStatistics =
+new CollectionStatistics(
+field,
+shardStatistics.maxDoc() + extraMaxDoc,
+shardStatistics.docCount() + extraDocCount,
+shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq,
+shardStatistics.sumDocFreq() + extraSumDocFreq);
+return globalStatistics;
+  }
+};
+searcher.setSimilarity(new BM25Similarity());

Review Comment:
   fixed in 
[88b7f2c](https://github.com/apache/lucene/pull/910/commits/88b7f2ca8e44e554878a0c10f8ee6bfeb19e57d7)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ywelsch commented on a diff in pull request #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

2022-05-24 Thread GitBox


ywelsch commented on code in PR #910:
URL: https://github.com/apache/lucene/pull/910#discussion_r880465199


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##
@@ -589,4 +589,52 @@ public SimScorer scorer(
   return new BM25Similarity().scorer(boost, collectionStats, termStats);
 }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {

Review Comment:
   I used the term distributed as that's the use that is mentioned on the 
Javadocs of the collectionStatistics method. Fine to rename it here 
([88b7f2c](https://github.com/apache/lucene/pull/910/commits/88b7f2ca8e44e554878a0c10f8ee6bfeb19e57d7)).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ywelsch commented on a diff in pull request #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

2022-05-24 Thread GitBox


ywelsch commented on code in PR #910:
URL: https://github.com/apache/lucene/pull/910#discussion_r880465450


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##
@@ -589,4 +589,52 @@ public SimScorer scorer(
   return new BM25Similarity().scorer(boost, collectionStats, termStats);
 }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {
+Directory dir = newDirectory();
+IndexWriterConfig iwc = new IndexWriterConfig();
+iwc.setSimilarity(randomCompatibleSimilarity());
+RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+String queryString = "foo";
+
+Document doc0 = new Document();
+doc0.add(new TextField("f", "foo", Store.NO));
+doc0.add(new TextField("g", "foo baz", Store.NO));
+w.addDocument(doc0);
+
+IndexReader reader = w.getReader();
+IndexSearcher searcher =
+new IndexSearcher(reader) {
+  @Override
+  public CollectionStatistics collectionStatistics(String field) 
throws IOException {
+CollectionStatistics shardStatistics = 
super.collectionStatistics(field);
+int extraMaxDoc = randomIntBetween(0, 10);
+int extraDocCount = randomIntBetween(0, extraMaxDoc);
+int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10);
+int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0, 
10);
+CollectionStatistics globalStatistics =
+new CollectionStatistics(
+field,
+shardStatistics.maxDoc() + extraMaxDoc,
+shardStatistics.docCount() + extraDocCount,
+shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq,
+shardStatistics.sumDocFreq() + extraSumDocFreq);
+return globalStatistics;
+  }
+};
+searcher.setSimilarity(new BM25Similarity());
+CombinedFieldQuery query =
+new CombinedFieldQuery.Builder()
+.addField("f")
+.addField("g")
+.addTerm(new BytesRef(queryString))
+.build();
+// just check that search does not fail
+searcher.search(query, 10);

Review Comment:
   I gave that a try in 
[88b7f2c](https://github.com/apache/lucene/pull/910/commits/88b7f2ca8e44e554878a0c10f8ee6bfeb19e57d7).
 Let me know what you think



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #923: Replace classpath with modulepath in the demo tutorial

2022-05-24 Thread GitBox


mocobeta commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1135898983

   > I guess the previous documentation didn't really explain how to set the 
CLASSPATH, but given that MODULEPATH is a new thing and many users may be 
unfamiliar with it, maybe we should take the chance to educate a bit how to use 
it here.
   
   Instead of explaining what is java module system and its usage, I would give 
concrete commands that work by copy-pasting for this tutorial.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #920: LUCENE-10589: increase upper bound of test range query to the maximum value + 1

2022-05-24 Thread GitBox


mocobeta commented on PR #920:
URL: https://github.com/apache/lucene/pull/920#issuecomment-1135958878

   @jtibshirani thanks for reviewing.
   
   > Stepping through what happens, it looks like we just hit a really unlucky 
query + data combination where it takes more than 150 steps to conclude the 
search.
   
   Yes, it looks like it hits a really unlucky combination of query and data - 
once you add a line something like `random().nextInt();` somewhere in the test 
code to move forward the random state, it becomes all green. First I was 
confused about what was happening there.
   
   > Another option is to decrease k to make the search more restrictive 
(currently it's set to 5, I think 1 would work instead).
   
   Smaller `k` (<= 4) surely fixes the problem. I cannot determine which 
approach is better in this context. Let me know if we should tweak `k` instead 
of the range query's upper bound, then I'll update this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #923: Replace classpath with modulepath in the demo tutorial

2022-05-24 Thread GitBox


msokolov commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1136012481

   
   >>  I found a few typos; if you have a moment maybe you could fix while 
you're updating?
   
   > Sure, could you tell me the lines we should fix?
   
   Hmm I added comments above, but github didn't post them yet - coming soon...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #923: Replace classpath with modulepath in the demo tutorial

2022-05-24 Thread GitBox


msokolov commented on code in PR #923:
URL: https://github.com/apache/lucene/pull/923#discussion_r880432441


##
lucene/demo/src/java/overview.html:
##
@@ -49,36 +49,35 @@ About the Demo
 demonstrates various functionalities of Lucene and how you can add Lucene to
 your applications.
 
-
-Setting your CLASSPATH
+
+Setting your MODULEPATH
 
 First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest
 Lucene distribution and then extract it to a working directory.
-You need four JARs: the Lucene JAR, the queryparser JAR, the common 
analysis JAR, and the Lucene
-demo JAR. You should see the Lucene JAR file in the modules/ directory you 
created
+You need Lucene demo and a few dependent modules.
+You should see the Lucene module (JAR) files in the modules/ and 
modules-thirdparty/ directory you created
 when you extracted the archive -- it should be named something like
 lucene-core-{version}.jar. You should also see
 files called lucene-queryparser-{version}.jar,
 lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, 
analysis/common/ and demo/,
-respectively.
-Put all four of these files in your Java CLASSPATH.
+"codefrag">lucene-demo-{version}.jar under modules directory.
+Put all of these files in your Java MODULEPATH.
 
 
 Indexing Files
 
 Once you've gotten this far you're probably itching to go. Let's build an
-index! Assuming you've set your CLASSPATH correctly, just type:
+index! Just type:
 
-java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+java --module-path modules:modules-thirdparty --module 
org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
 
 This will produce a subdirectory called index
 which will contain an index of all of the Lucene source code.
 To search the index type:
 
-java org.apache.lucene.demo.SearchFiles
+java --module-path modules:modules-thirdparty --add-modules 
jdk.unsupported --module 
org.apache.lucene.demo/org.apache.lucene.demo.SearchFiles
 
 You'll be prompted for a query. Type in a gibberish or made up word (for 
example: 
 "supercalifragilisticexpialidocious").

Review Comment:
   "visibile" -> "visible"
   "ile" -> "file"



##
lucene/demo/src/java/overview.html:
##
@@ -49,36 +49,35 @@ About the Demo
 demonstrates various functionalities of Lucene and how you can add Lucene to
 your applications.
 
-
-Setting your CLASSPATH
+
+Setting your MODULEPATH
 
 First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest
 Lucene distribution and then extract it to a working directory.
-You need four JARs: the Lucene JAR, the queryparser JAR, the common 
analysis JAR, and the Lucene
-demo JAR. You should see the Lucene JAR file in the modules/ directory you 
created
+You need Lucene demo and a few dependent modules.
+You should see the Lucene module (JAR) files in the modules/ and 
modules-thirdparty/ directory you created
 when you extracted the archive -- it should be named something like
 lucene-core-{version}.jar. You should also see
 files called lucene-queryparser-{version}.jar,
 lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, 
analysis/common/ and demo/,
-respectively.
-Put all four of these files in your Java CLASSPATH.
+"codefrag">lucene-demo-{version}.jar under modules directory.
+Put all of these files in your Java MODULEPATH.
 
 
 Indexing Files
 
 Once you've gotten this far you're probably itching to go. Let's build an
-index! Assuming you've set your CLASSPATH correctly, just type:
+index! Just type:
 
-java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+java --module-path modules:modules-thirdparty --module 
org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
 
 This will produce a subdirectory called index
 which will contain an index of all of the Lucene source code.
 To search the index type:
 
-java org.apache.lucene.demo.SearchFiles
+java --module-path modules:modules-thirdparty --add-modules 
jdk.unsupported --module 
org.apache.lucene.demo/org.apache.lucene.demo.SearchFiles
 
 You'll be prompted for a query. Type in a gibberish or made up word (for 
example: 
 "supercalifragilisticexpialidocious").

Review Comment:
   "maching" -> "matching"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10385) Implement Weight#count on IndexSortSortedNumericDocValuesRangeQuery.

2022-05-24 Thread Alan Woodward (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Woodward resolved LUCENE-10385.

Resolution: Fixed

> Implement Weight#count on IndexSortSortedNumericDocValuesRangeQuery.
> 
>
> Key: LUCENE-10385
> URL: https://issues.apache.org/jira/browse/LUCENE-10385
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> This query can count matches by computing the first and last matching doc IDs 
> using binary search.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets

2022-05-24 Thread Alan Woodward (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Woodward resolved LUCENE-10229.

Resolution: Fixed

> Match offsets should be consistent for fields with positions and fields with 
> offsets
> 
>
> Key: LUCENE-10229
> URL: https://issues.apache.org/jira/browse/LUCENE-10229
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> This is a follow-up of LUCENE-10223 in which it was discovered that fields 
> with
> offsets don't highlight some more complex interval queries properly.  Alan 
> says:
> {quote}
> It's because it returns the position of the inner match, but the offsets of 
> the outer.  And so if you're re-analyzing and retrieving offsets by looking 
> at the positions, you get the 'right' thing.  It's not obvious to me what the 
> correct response is here, but thinking about it the current behaviour is kind 
> of the worst of both worlds, and perhaps we should change it so that you get 
> offsets of the inner match as standard, and then the outer match is returned 
> as part of the sub matches.
> {quote}
> Intervals are nicely separated into "basic intervals" and "filters" which 
> restrict some other source of intervals, here is the original documentation:
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50
> My experience from an extended period of using interval queries in a frontend 
> where they're highlighted is that filters are restrictions that should not be 
> highlighted - it's the source intervals that people care about. Filters are 
> what you remove or where you give proper context to source intervals.
> The test code contributed in LUCENE-10223 contains numerous query-highlight 
> examples (on fields with positions) where this intuition is demonstrated on 
> all kinds of interval functions:
> https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542
> This issue is about making the internals work consistently for fields with 
> positions and fields with offsets.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-24 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541583#comment-17541583
 ] 

Michael Sokolov commented on LUCENE-10577:
--

Question:  should I post one commit adding a new Lucene93 codec and 
Lucene93HnswVectorsFormat etc, and another one actually implementing these 
changes to the format, or smoosh them together into one gigantic change? I'm 
leaning towards separating and creating a new format that is just a clone of 
the existing one, and then following up with the actual changes.

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe

2022-05-24 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541620#comment-17541620
 ] 

Adrien Grand commented on LUCENE-10590:
---

Does the indexing logic rely on tie breaking by node ID? If not, maybe 
index-time graph search could stop as soon as the k-th nearest vector is equal 
to the input vector?

> Indexing all zero vectors leads to heat death of the universe
> -
>
> Key: LUCENE-10590
> URL: https://issues.apache.org/jira/browse/LUCENE-10590
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael Sokolov
>Priority: Major
>
> By accident while testing something else, I ran a luceneutil test indexing 1M 
> 100d vectors where all the vectors were all zeroes. This caused indexing to 
> take a very long time (~40x normal - it did eventually complete) and the 
> search performance was similarly bad.  We should not degrade by orders of 
> magnitude with even the worst data though.
> I'm not entirely sure what the issue is, but perhaps as long as we keep 
> finding hits that are "better" we keep exploring the graph, where better 
> means (score, -docid) >= (lowest score, -docid). If that's right and all docs 
> have the same score, then we probably need to either switch to > (but this 
> could lead to poorer recall in normal cases) or introduce some kind of 
> minimum score threshold?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov opened a new pull request, #924: Create Lucene93 Codec and move Lucene92 to backwards_codecs

2022-05-24 Thread GitBox


msokolov opened a new pull request, #924:
URL: https://github.com/apache/lucene/pull/924

   I want to do this in order to enable changes in the HnswVectorsFormat


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-24 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541621#comment-17541621
 ] 

Michael Sokolov commented on LUCENE-10577:
--

https://github.com/apache/lucene/pull/924/files is for creating Lucene93 Codec 
with no change. One question I have about that: how do we create the indexes 
that we check into backward-codecs tests, and do I need to do that?

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #907: LUCENE-10357 Ghost fields and postings/points

2022-05-24 Thread GitBox


jpountz commented on code in PR #907:
URL: https://github.com/apache/lucene/pull/907#discussion_r880758668


##
lucene/core/src/java/org/apache/lucene/index/MappedMultiFields.java:
##
@@ -43,8 +43,8 @@ public MappedMultiFields(MergeState mergeState, MultiFields 
multiFields) {
   @Override
   public Terms terms(String field) throws IOException {
 MultiTerms terms = (MultiTerms) in.terms(field);
-if (terms == null) {
-  return null;
+if (terms == null || terms == Terms.EMPTY) {

Review Comment:
   can we leave the `if` statement as is?



##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -1378,7 +1378,7 @@ private static Status.TermIndexStatus checkFields(
   computedFieldCount++;
 
   final Terms terms = fields.terms(field);
-  if (terms == null) {
+  if (terms == Terms.EMPTY) {

Review Comment:
   We should remove this `if` statement. There is a check a few lines above 
that indexing is enabled on the field, so terms must not be null.



##
lucene/core/src/java/org/apache/lucene/index/FrozenBufferedUpdates.java:
##
@@ -595,7 +595,7 @@ private void setField(String field) throws IOException {
 
 DocIdSetIterator nextTerm(String field, BytesRef term) throws IOException {
   setField(field);
-  if (termsEnum != null) {
+  if (termsEnum != null && termsEnum != TermsEnum.EMPTY) {

Review Comment:
   would it work to leave the `if` statement as is?



##
lucene/test-framework/src/java/org/apache/lucene/tests/codecs/asserting/AssertingPostingsFormat.java:
##
@@ -79,7 +79,10 @@ public Iterator iterator() {
 @Override
 public Terms terms(String field) throws IOException {
   Terms terms = in.terms(field);
-  return terms == null ? null : new 
AssertingLeafReader.AssertingTerms(terms);
+  if (terms == Terms.EMPTY) {

Review Comment:
   let's remove this check, and assert that `terms` in not null (`terms(String 
field)` map only get called on codec APIs if the field is indexed



##
lucene/codecs/src/java/org/apache/lucene/codecs/bloom/BloomFilteringPostingsFormat.java:
##
@@ -200,8 +200,8 @@ public Terms terms(String field) throws IOException {
 return delegateFieldsProducer.terms(field);
   } else {
 Terms result = delegateFieldsProducer.terms(field);
-if (result == null) {
-  return null;
+if (result == null || result == Terms.EMPTY) {

Review Comment:
   since there are no ghost fields anymore, we should be able to remove the 
`if` statement entirely, does it cause test failures?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #897: LUCENE-10266 Move nearest-neighbor search on points to core

2022-05-24 Thread GitBox


jpountz commented on code in PR #897:
URL: https://github.com/apache/lucene/pull/897#discussion_r880768892


##
lucene/core/src/java/org/apache/lucene/document/NearestNeighbor.java:
##
@@ -220,7 +216,7 @@ public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
   }
 
   /** Holds one hit from {@link NearestNeighbor#nearest} */
-  static class NearestHit {
+  public static class NearestHit {

Review Comment:
   can it remain pkg-private too?



##
lucene/core/src/java/org/apache/lucene/document/LatLonPoint.java:
##
@@ -362,4 +375,71 @@ public static Query newDistanceFeatureQuery(
 }
 return query;
   }
+
+  /**
+   * Finds the {@code n} nearest indexed points to the provided point, 
according to Haversine
+   * distance.
+   *
+   * This is functionally equivalent to running {@link MatchAllDocsQuery} 
with a {@link
+   * LatLonDocValuesField#newDistanceSort}, but is far more efficient since it 
takes advantage of
+   * properties the indexed BKD tree. Multi-valued fields are currently not 
de-duplicated, so if a
+   * document had multiple instances of the specified field that make it into 
the top n, that
+   * document will appear more than once.
+   *
+   * Documents are ordered by ascending distance from the location. The 
value returned in {@link
+   * FieldDoc} for the hits contains a Double instance with the distance in 
meters.
+   *
+   * @param searcher IndexSearcher to find nearest points from.
+   * @param field field name. must not be null.
+   * @param latitude latitude at the center: must be within standard +/-90 
coordinate bounds.
+   * @param longitude longitude at the center: must be within standard +/-180 
coordinate bounds.
+   * @param n the number of nearest neighbors to retrieve.
+   * @return TopFieldDocs containing documents ordered by distance, where the 
field value for each
+   * {@link FieldDoc} is the distance in meters
+   * @throws IllegalArgumentException if the underlying PointValues is not a 
{@code
+   * Lucene60PointsReader} (this is a current limitation), or if {@code 
field} or {@code

Review Comment:
   we removed this limitation that `Lucene60PointsReader` is required



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #922: Index only the docs for FacetField posting list

2022-05-24 Thread GitBox


gsmiller commented on PR #922:
URL: https://github.com/apache/lucene/pull/922#issuecomment-1136295588

   I'm not actually sure these options are referenced/honored anywhere during 
indexing, which might explain why you don't see a difference. Maybe you've dug 
into this deeper and know better, but I think all the term indexing for 
drill-down happens in `FacetsConfig#indexDrillDownTerms`, where it creates new 
`StringField` instances for the doc for the actual indexing. Like I said 
though, maybe you've dug into this deeper?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy opened a new pull request, #2661: SOLR-16213 Upgrade Jackson to version 2.13.3

2022-05-24 Thread GitBox


janhoy opened a new pull request, #2661:
URL: https://github.com/apache/lucene-solr/pull/2661

   https://issues.apache.org/jira/browse/SOLR-16213


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #924: Create Lucene93 Codec and move Lucene92 to backwards_codecs

2022-05-24 Thread GitBox


msokolov commented on PR #924:
URL: https://github.com/apache/lucene/pull/924#issuecomment-1136470133

   In case it wasn't clear this is literally just bumping the version numbers 
and doing the requisite copy/paste to get all the symbols to resolve properly, 
and tests to pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

2022-05-24 Thread GitBox


jtibshirani commented on code in PR #910:
URL: https://github.com/apache/lucene/pull/910#discussion_r881050412


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##
@@ -589,4 +589,97 @@ public SimScorer scorer(
   return new BM25Similarity().scorer(boost, collectionStats, termStats);
 }
   }
+
+  public void testOverrideCollectionStatistics() throws IOException {
+Directory dir = newDirectory();
+IndexWriterConfig iwc = new IndexWriterConfig();
+Similarity similarity = randomCompatibleSimilarity();
+iwc.setSimilarity(similarity);
+RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+int numMatch = atLeast(10);
+for (int i = 0; i < numMatch; i++) {
+  Document doc = new Document();
+  if (random().nextBoolean()) {
+doc.add(new TextField("a", "baz", Store.NO));
+doc.add(new TextField("b", "baz", Store.NO));
+for (int k = 0; k < 2; k++) {
+  doc.add(new TextField("ab", "baz", Store.NO));
+}
+w.addDocument(doc);
+doc.clear();
+  }
+  int freqA = random().nextInt(5) + 1;
+  for (int j = 0; j < freqA; j++) {
+doc.add(new TextField("a", "foo", Store.NO));
+  }
+  int freqB = random().nextInt(5) + 1;
+  for (int j = 0; j < freqB; j++) {
+doc.add(new TextField("b", "foo", Store.NO));
+  }
+  int freqAB = freqA + freqB;
+  for (int j = 0; j < freqAB; j++) {
+doc.add(new TextField("ab", "foo", Store.NO));
+  }
+  w.addDocument(doc);
+}
+
+IndexReader reader = w.getReader();
+
+int extraMaxDoc = randomIntBetween(0, 10);
+int extraDocCount = randomIntBetween(0, extraMaxDoc);
+
+int extraSumDocFreqA = extraDocCount + randomIntBetween(0, 10);

Review Comment:
   I think it'd make more sense to have a single `sumDocFreq` here. This 
represents the number of unique term-document pairs, and we can't just add the 
values across different fields. In fact `CombinedFieldQuery` chooses to take a 
maximum of the `sumDocFreq`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #924: Create Lucene93 Codec and move Lucene92 to backwards_codecs

2022-05-24 Thread GitBox


jtibshirani commented on PR #924:
URL: https://github.com/apache/lucene/pull/924#issuecomment-1136525898

   I think you may have forgotten to create unit tests for the old format (step 
2 here: 
https://github.com/apache/lucene/tree/main/lucene/backward-codecs#making-index-format-changes).
 Also, would it make sense to merge this to a feature branch, so you could 
bundle it with the quantization changes you're planning? We recently took this 
approach for some other changes to the vectors format like LUCENE-10502.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob commented on pull request #2661: SOLR-16213 Upgrade Jackson to version 2.13.3

2022-05-24 Thread GitBox


madrob commented on PR #2661:
URL: https://github.com/apache/lucene-solr/pull/2661#issuecomment-1136629649

   Can we make a changes entry? I think there was already one for a previous 
Jackson upgrade in this release (phone commenting so can't easily verify)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on pull request #728: LUCENE-10194 Buffer KNN vectors on disk

2022-05-24 Thread GitBox


LuXugang commented on PR #728:
URL: https://github.com/apache/lucene/pull/728#issuecomment-1136706160

   It seems like the core part is how to avoid that all vector values of all 
fields loaded into memory during Indexing. IIUC, as @rmuir said, we could 
stream vectors to the codec api directly.  a rough draft codec of `.vec` may 
seems like this:
   https://user-images.githubusercontent.com/6985548/170176157-76bf2506-6c4b-480f-8191-919443077b15.png";>
   
   
   Just similar to how `.fdx` wrote stored values on the fly. After `.vec` file 
closed, we then read this file and build a HNSW graph.
   
   We could locate one field's part vector values in a `chunk` by node and doc 
, but surely that it is bit slower compare that one field's all vector values 
stored in one continuous interval (vector value could be random access by 
ord(node) and dimension).
   
   > If a user had 100 vector fields, then now we might have 100+ files being 
written concurrently, multiplied by the number of segments we're writing at the 
same time. It seems like this could cause problems 
   
   @jtibshirani  , or we still try to write all field's all values to a single 
temp file like the picture above , when flush triggered, we read this temp file 
and create the Lucene92's codec ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit

2022-05-24 Thread GitBox


gsmiller commented on PR #915:
URL: https://github.com/apache/lucene/pull/915#issuecomment-1136707326

   Thanks @Yuti-G for the feedback and benchmark results! I appreciate you 
taking a look since I know you're quite familiar with this code. I saw a couple 
opportunities to de-dupe some code, but you did all the hard work introducing 
the feature. Thanks again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit

2022-05-24 Thread GitBox


gsmiller commented on code in PR #915:
URL: https://github.com/apache/lucene/pull/915#discussion_r881201779


##
lucene/facet/src/java/org/apache/lucene/facet/sortedset/AbstractSortedSetDocValueFacetCounts.java:
##
@@ -0,0 +1,333 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.sortedset;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.PrimitiveIterator;
+import org.apache.lucene.facet.FacetResult;
+import org.apache.lucene.facet.Facets;
+import org.apache.lucene.facet.FacetsConfig;
+import org.apache.lucene.facet.FacetsConfig.DimConfig;
+import org.apache.lucene.facet.LabelAndValue;
+import org.apache.lucene.facet.TopOrdAndIntQueue;
+import org.apache.lucene.facet.sortedset.SortedSetDocValuesReaderState.DimTree;
+import 
org.apache.lucene.facet.sortedset.SortedSetDocValuesReaderState.OrdRange;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.PriorityQueue;
+
+/** Base class for SSDV faceting implementations. */
+abstract class AbstractSortedSetDocValueFacetCounts extends Facets {
+
+  private static final Comparator FACET_RESULT_COMPARATOR =
+  new Comparator<>() {
+@Override
+public int compare(FacetResult a, FacetResult b) {
+  if (a.value.intValue() > b.value.intValue()) {
+return -1;
+  } else if (b.value.intValue() > a.value.intValue()) {
+return 1;
+  } else {
+return a.dim.compareTo(b.dim);
+  }
+}
+  };
+
+  final SortedSetDocValuesReaderState state;
+  final FacetsConfig stateConfig;
+  final SortedSetDocValues dv;
+  final String field;
+
+  AbstractSortedSetDocValueFacetCounts(SortedSetDocValuesReaderState state) 
throws IOException {
+this.state = state;
+this.field = state.getField();
+this.stateConfig = state.getFacetsConfig();
+this.dv = state.getDocValues();
+  }
+
+  @Override
+  public FacetResult getTopChildren(int topN, String dim, String... path) 
throws IOException {
+validateTopN(topN);
+TopChildrenForPath topChildrenForPath = getTopChildrenForPath(topN, dim, 
path);
+return createFacetResult(topChildrenForPath, dim, path);
+  }
+
+  @Override
+  public Number getSpecificValue(String dim, String... path) throws 
IOException {
+if (path.length != 1) {
+  throw new IllegalArgumentException("path must be length=1");
+}
+int ord = (int) dv.lookupTerm(new BytesRef(FacetsConfig.pathToString(dim, 
path)));
+if (ord < 0) {
+  return -1;
+}
+
+return getCount(ord);
+  }
+
+  @Override
+  public List getAllDims(int topN) throws IOException {
+validateTopN(topN);
+List results = new ArrayList<>();
+for (String dim : state.getDims()) {
+  TopChildrenForPath topChildrenForPath = getTopChildrenForPath(topN, dim);
+  FacetResult facetResult = createFacetResult(topChildrenForPath, dim);
+  if (facetResult != null) {
+results.add(facetResult);
+  }
+}
+
+// Sort by highest count:
+results.sort(FACET_RESULT_COMPARATOR);
+return results;
+  }
+
+  @Override
+  public List getTopDims(int topNDims, int topNChildren) throws 
IOException {
+validateTopN(topNDims);
+validateTopN(topNChildren);
+
+// Creates priority queue to store top dimensions and sort by their 
aggregated values/hits and
+// string values.
+PriorityQueue pq =
+new PriorityQueue<>(topNDims) {
+  @Override
+  protected boolean lessThan(DimValue a, DimValue b) {
+if (a.value > b.value) {
+  return false;
+} else if (a.value < b.value) {
+  return true;
+} else {
+  return a.dim.compareTo(b.dim) > 0;
+}
+  }
+};
+
+// Keep track of intermediate results, if we compute them, so we can reuse 
them later:
+Map intermediateResults = null;
+
+for (String dim : state.getDims()) {
+  DimConfig dimConfig = stateConfig.getDimConfig(dim);
+  int