[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat

2021-04-05 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314734#comment-17314734
 ] 

Robert Muir commented on LUCENE-9855:
-

the "strategy" is a huge antipattern. lets split into separate codecs so that 
?Format has a real api. right now its too difficult to improve the 
implementation (and there are massive memory inefficiencies) or even provide 
different options. too much stuff tangled into one format. its so bad that we 
cant even name it.

> Reconsider codec name VectorFormat
> --
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Blocker
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat

2021-04-05 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314794#comment-17314794
 ] 

Tomoko Uchida commented on LUCENE-9855:
---

{quote}right now its too difficult to improve the implementation (and there are 
massive memory inefficiencies) or even provide different options. too much 
stuff tangled into one format. its so bad that we cant even name it.
{quote}
I didn't intend to discuss about implementations at here but try to deal with 
the naming issue, I understand the current status of the vector search (or 
hnsw) is not perfect though.

It sounds like that we can not only name it but also ship it with 9.0 to me. If 
so, I'm really afraid to say but the discussion seem to go beyond this issue 
(and me); should we return to LUCENE-9004 or LUCENE-9322 to treat the 
fundamental question.

> Reconsider codec name VectorFormat
> --
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Blocker
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9902) Update faceting API to use modern Java features

2021-04-05 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314853#comment-17314853
 ] 

Michael McCandless commented on LUCENE-9902:


{quote}I wonder if we need a specific 8.9 CHANGES.txt entry for this change so 
that it gets picked up in the 8.9 release?
{quote}
What we typically do is, in {{main}} branch, put a {{CHANGES.txt}} entry under 
the {{8.9.0}} release section (not in the {{9.0}} section!).  And then 
backport.  This way there is only one entry for each change, appearing under 
the earliest release that first got that feature/fix.

> Update faceting API to use modern Java features
> ---
>
> Key: LUCENE-9902
> URL: https://issues.apache.org/jira/browse/LUCENE-9902
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Gautam Worah
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I was using the {{public int getOrdinal(String dim, String[] path)}} API for 
> a single {{path}} String and found myself creating an array with a single 
> element. We can start using variable length args for this method.
> I also propose this change:
>  I wanted to know the specific count of an ordinal using using the 
> {{getValue}} API from {{IntTaxonomyFacets}} but the method is private. It 
> would be good if we could change it to {{protected}} so that users can know 
> the value of an ordinal without looking up the {{FacetLabel}} and then 
> checking its value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9903) Search is not working while migrating Lucene 3.6.2 to 8.7.0

2021-04-05 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314858#comment-17314858
 ] 

Michael McCandless commented on LUCENE-9903:


Lots of things changed between Lucene 3.6.2 and 8.7.0!

But this is not yet issue-worthy, until we can isolate to a specific problem.

So, could you instead send an email to the java users list 
({{java-u...@lucene.apache.org}}) and include more details?  Maybe make a tiny 
example program (or real unit test) showing the issue?

> Search is not working while migrating Lucene 3.6.2 to 8.7.0
> ---
>
> Key: LUCENE-9903
> URL: https://issues.apache.org/jira/browse/LUCENE-9903
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.7
>Reporter: lakshman
>Priority: Major
>
> We are upgrading the Lucene version from 3.6.2 to 8.7.0 version, we are 
> facing the search issue, once request is reach to search method then response 
> is not coming out  the execution flow is blocked by search method
> public void collect(IndexSearcher is) throws CorruptIndexException,public 
> void collect(IndexSearcher is) throws CorruptIndexException, IOException {
> TopFieldCollector collector = TopFieldCollector.create(TopFieldCollector 
> collector = TopFieldCollector.create( reverse ? reverseSort : sort, numr, 
> val);
> is.setSimilarity(new ClassicSimilarity());
> is.search(query, numr); 
> }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

2021-04-05 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314865#comment-17314865
 ] 

Michael McCandless commented on LUCENE-9850:


Impressive results!  And I love the flame charts that now come builtin with 
JMC/JFR in the JDK!

I would expect {{XTerm}} to show speedups since this is largely dominated by 
decoding many postings blocks.  But it is odd to see the {{XTermYSort}} tasks 
negatively impacted: those tasks are just sorting by a {{DocValues}} field 
instead of default text relevance (BM25).

Net/net I think giving the impressive improvement in compression, speedup in 
raw decode of postings blocks, I think this is worth doing?

These results are {{wikimediumall}} right?  If you re-run but with only the 
tasks that regressed above, do they still show regression?  I'm wondering if 
hotspot compilation noise is contributing ...

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> ---
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
> Attachments: apply_exceptions.png, bulk_read_1.png, bulk_read_2.png, 
> for.png, pfor.png
>
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #56: LUCENE-9883: Turn on ecj missingEnumCaseDespiteDefault setting

2021-04-05 Thread GitBox


mikemccand commented on pull request #56:
URL: https://github.com/apache/lucene/pull/56#issuecomment-813397017


   Thank you for looking into backporting!  I think it's fine to leave this as 
9.x / Lucene only.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9573) Add back compat tests for VectorFormat to TestBackwardsCompatibility

2021-04-05 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-9573:
---
Fix Version/s: (was: main (9.0))
   9.1

> Add back compat tests for VectorFormat to TestBackwardsCompatibility
> 
>
> Key: LUCENE-9573
> URL: https://issues.apache.org/jira/browse/LUCENE-9573
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Blocker
> Fix For: 9.1
>
>
> In LUCENE-9322 we add a new VectorFormat to the index. This issue is about 
> adding backwards compatibility tests for it once the index format has 
> crystallized into its 9.0 form



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9573) Add back compat tests for VectorFormat to TestBackwardsCompatibility

2021-04-05 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314874#comment-17314874
 ] 

Michael McCandless commented on LUCENE-9573:


Sorry, yes, you are right [~jpountz]!  This is 9.1 blocker.  I'll change the 
fix version!  Sorry for the confusion :)

> Add back compat tests for VectorFormat to TestBackwardsCompatibility
> 
>
> Key: LUCENE-9573
> URL: https://issues.apache.org/jira/browse/LUCENE-9573
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Blocker
> Fix For: main (9.0)
>
>
> In LUCENE-9322 we add a new VectorFormat to the index. This issue is about 
> adding backwards compatibility tests for it once the index format has 
> crystallized into its 9.0 form



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat

2021-04-05 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314876#comment-17314876
 ] 

Tomoko Uchida commented on LUCENE-9855:
---

I understand the vector search is a kind of incubation project; I remember at 
the very early stage I hoped we could start it from the sandbox project but it 
was not possible with the apis. Here I've tried to find some level of consensus 
among us, but my attempt have not seem to be going well. :) (nevertheless I 
think I am inclined to support Julie's perspective or design on this.)

I can't say "it's a blocker so just fix it" nor simply throw away this, I'd 
like to wait and see to hear from others for a little while.

> Reconsider codec name VectorFormat
> --
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Blocker
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9857) Skip cache building if IndexOrDocValuesQuery choose the dvQuery

2021-04-05 Thread Feng Guo (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314879#comment-17314879
 ] 

Feng Guo commented on LUCENE-9857:
--

Thank you for the reply! I'm so sorry that i did not notice this code you point 
out. I'm digging into a case that queries become slower after a forcemerge, but 
obviously this issue is not the root cause now. Thank you again for the 
explaination!

> Skip cache building if IndexOrDocValuesQuery choose the dvQuery
> ---
>
> Key: LUCENE-9857
> URL: https://issues.apache.org/jira/browse/LUCENE-9857
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: main (9.0)
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> IndexOrDocValuesQuery can automatically use dvQueries when the cost > 8 *l 
> eadcost, And the LRUQueryCache skips cache building when cost > 250(By 
> default) * leadcost. There is a gap between 8 and 250, which means if the 
> factor is just between 8 and 250 (e.g. cost = 10 * leadcost), the 
> IndexOrDocValueQuery will choose the dvQueries but LRUQueryCache still build 
> cache for it.
> IndexOrDocValuesQuery aims to speed up queries when the leadcost is small, 
> but building cache by dvScorers can make it meaningless because it needs to 
> scan all the docvalues. This can be rather slow for big segments, so maybe we 
> should skip the cache building for IndexOrDocValuesQuery when it chooses 
> dvQueries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 closed pull request #29: LUCENE-9857: Skip cache building if IndexOrDocValuesQuery choose the dvQuery

2021-04-05 Thread GitBox


gf2121 closed pull request #29:
URL: https://github.com/apache/lucene/pull/29


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9857) Skip cache building if IndexOrDocValuesQuery choose the dvQuery

2021-04-05 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo resolved LUCENE-9857.
--
Resolution: Fixed

> Skip cache building if IndexOrDocValuesQuery choose the dvQuery
> ---
>
> Key: LUCENE-9857
> URL: https://issues.apache.org/jira/browse/LUCENE-9857
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: main (9.0)
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> IndexOrDocValuesQuery can automatically use dvQueries when the cost > 8 *l 
> eadcost, And the LRUQueryCache skips cache building when cost > 250(By 
> default) * leadcost. There is a gap between 8 and 250, which means if the 
> factor is just between 8 and 250 (e.g. cost = 10 * leadcost), the 
> IndexOrDocValueQuery will choose the dvQueries but LRUQueryCache still build 
> cache for it.
> IndexOrDocValuesQuery aims to speed up queries when the leadcost is small, 
> but building cache by dvScorers can make it meaningless because it needs to 
> scan all the docvalues. This can be rather slow for big segments, so maybe we 
> should skip the cache building for IndexOrDocValuesQuery when it chooses 
> dvQueries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9889) Lucene (unexpected ) fsync on existing segments

2021-04-05 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314885#comment-17314885
 ] 

Michael McCandless commented on LUCENE-9889:


Thank you for opening this issue [~rahul196...@gmail.com]!

It is indeed weird that Lucene is re-opening segment files it already long ago 
wrote and close and fsync'd, to fsync them again.

There is some fun/exciting history here.  Long ago Lucene's {{IndexWriter}} 
used to keep track of which files were "dirty" (written recently and not yet 
fsync'd), but that was somehow complex and buggy and sometimes sprouted up bad 
memory leaks, and so at one point we moved that tracking from {{IndexWriter}} 
down into {{FSDirectory}}, but then somehow, later, we eventually just removed 
the dirty logic from {{FSDirectory}} and changed to always fsync'ing every 
file.  I agree this is odd and we should perhaps revisit that dirty logic.

Related issues: LUCENE-3237, LUCENE-5570, LUCENE-5588, LUCENE-6150 (this is 
where the dirty file tracking was removed from {{FSDirectory}}).

> Lucene (unexpected ) fsync on existing segments
> ---
>
> Key: LUCENE-9889
> URL: https://issues.apache.org/jira/browse/LUCENE-9889
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.7.2
>Reporter: Rahul Goswami
>Priority: Major
>
>  
> If one of the existing segment files is opened by another (say a 3rd party) 
> process, it can causing a parallel commit to fail with an error complaining 
> about the index files to be locked by another process. Upon debugging, I see 
> that fsync is being called during commit on already existing segment files, 
> and failure to open the file in write mode causes this. But this should not 
> be an expected behavior since there is no reason for a commit to open an 
> existing segment file in WRITE mode to fsync. Please note that in this case, 
> the index file was also a part of a saved commit point, so there is all the 
> more reason to not fsync it.    
>  
> The line of code I am referring to is as below:
> try (final FileChannel file = FileChannel.open(fileToSync, isDir ? 
> StandardOpenOption.READ : StandardOpenOption.WRITE))
>  
> in method fsync(Path fileToSync, boolean isDir) of the class file
>  
> lucene\core\src\java\org\apache\lucene\util\IOUtils.java
>  
>  
> Opening this Jira after discussion with Mike Candless and Michael Sokolov on 
> the dev mailing list here:
> [Lucene - Java Developer - Lucene (unexpected ) fsync on existing segments 
> (nabble.com)|https://lucene.472066.n3.nabble.com/Lucene-unexpected-fsync-on-existing-segments-td4469731.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9892) Ensure @since tag for public classes

2021-04-05 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314886#comment-17314886
 ] 

Michael McCandless commented on LUCENE-9892:


Hi [~tomoko], I saw the discussion in LUCENE-9890, that we should fully 
decouple {{@since}} from the existing {{@lucene.experimental}} but I though 
your idea here would still be interesting?

I.e. it seems like there ought to be some simple static tooling that can detect 
when an API first appeared and insert {{@since}} tags into the source tree, 
maybe as part of pre-release process?

> Ensure @since tag for public classes
> 
>
> Key: LUCENE-9892
> URL: https://issues.apache.org/jira/browse/LUCENE-9892
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: general/javadocs
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Can we ensure that all public classes' documentation have @since tag by some 
> precommit task using doclet API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9857) Skip cache building if IndexOrDocValuesQuery choose the dvQuery

2021-04-05 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314949#comment-17314949
 ] 

Julie Tibshirani commented on LUCENE-9857:
--

No problem!

> Skip cache building if IndexOrDocValuesQuery choose the dvQuery
> ---
>
> Key: LUCENE-9857
> URL: https://issues.apache.org/jira/browse/LUCENE-9857
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: main (9.0)
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> IndexOrDocValuesQuery can automatically use dvQueries when the cost > 8 *l 
> eadcost, And the LRUQueryCache skips cache building when cost > 250(By 
> default) * leadcost. There is a gap between 8 and 250, which means if the 
> factor is just between 8 and 250 (e.g. cost = 10 * leadcost), the 
> IndexOrDocValueQuery will choose the dvQueries but LRUQueryCache still build 
> cache for it.
> IndexOrDocValuesQuery aims to speed up queries when the leadcost is small, 
> but building cache by dvScorers can make it meaningless because it needs to 
> scan all the docvalues. This can be rather slow for big segments, so maybe we 
> should skip the cache building for IndexOrDocValuesQuery when it chooses 
> dvQueries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss merged pull request #63: LUCENE-9901: UnicodeData.java has no regeneration task

2021-04-05 Thread GitBox


dweiss merged pull request #63:
URL: https://github.com/apache/lucene/pull/63


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9901) UnicodeData.java has no regeneration task

2021-04-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315014#comment-17315014
 ] 

ASF subversion and git services commented on LUCENE-9901:
-

Commit fbf9191abf2ad4acd26bae16e075cdeb79d33a39 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=fbf9191 ]

LUCENE-9901: UnicodeData.java has no regeneration task (#63)



> UnicodeData.java has no regeneration task
> -
>
> Key: LUCENE-9901
> URL: https://issues.apache.org/jira/browse/LUCENE-9901
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: modules/analysis
>Reporter: Uwe Schindler
>Assignee: Dawid Weiss
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When moving build system to gradle, we lost the following groovy script, 
> which is used to regenerate the UnicodeData.java file to be in line with the 
> actually used ICU4J version. The groovy script is still in the repository, 
> but it's no longer used by the build system:
> https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy
> To execute it, we need to convert it to a Gradle task that depends on the 
> same version of ICU4J that we use as analysis/icu dependency Not sure how to 
> do this, maybe it's easy using palantir).
> The file should also be hashed and put into the regenerated file hases:
> https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java
> Old Ant task is here:
> https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9901) UnicodeData.java has no regeneration task

2021-04-05 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-9901.
-
Fix Version/s: main (9.0)
   Resolution: Fixed

> UnicodeData.java has no regeneration task
> -
>
> Key: LUCENE-9901
> URL: https://issues.apache.org/jira/browse/LUCENE-9901
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: modules/analysis
>Reporter: Uwe Schindler
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: main (9.0)
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When moving build system to gradle, we lost the following groovy script, 
> which is used to regenerate the UnicodeData.java file to be in line with the 
> actually used ICU4J version. The groovy script is still in the repository, 
> but it's no longer used by the build system:
> https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy
> To execute it, we need to convert it to a Gradle task that depends on the 
> same version of ICU4J that we use as analysis/icu dependency Not sure how to 
> do this, maybe it's easy using palantir).
> The file should also be hashed and put into the regenerated file hases:
> https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java
> Old Ant task is here:
> https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat

2021-04-05 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315067#comment-17315067
 ] 

Michael Sokolov commented on LUCENE-9855:
-

OK, naming is hard! I think it will help to break down all the classes we are 
(or might be) talking about here. These are the bulk of the classes/packages 
added as part of this vector/knn search effort:

 
{code:java}
o.a.l.codecs: VectorFormat, VectorReader, VectorWriter
o.a.l.codecs.lucene90: Lucene90VectorFormat, Lucene90VectorReader, 
Lucene90VectorWriter
o.a.l.index: VectorValues,    VectorValuesWriter, RandomAccessVectorValues, 
RandomAccessVectorValuesProducer
o.a.l.search:
o.a.l.util.hnsw: HnswGraph, HnswGrahPbuilder, NeighborQueue, NeighborArray, 
BoundsChecker
{code}
 

I think the scope of this issue is basically – consider a more specific name 
for these vector apis (that isn't so easily confused with TermVectors), and use 
plural form.

Then we got into a discussion of whether this format is hnsw-only, but 
[~julietibs] points out that (a) we already decided it would handle multiple 
ANN algos, and (b) we can have algorithm-specific names in the implementation 
classes (the ones in o.a.l.codecs.lucene90 + any associated utility classes) 
without needing to make that change anywhere else (at the interface level).

[~rcmuir] also raised some other issues; one performance-related, another we 
should have this strategy pattern at all. I might have missed something else? I 
think those are separate issues though: Robert please feel free to open some 
other JIRA if you think we ought to pursue further?

Given that, I think we are talking here about the names of:

 
{code:java}
o.a.l.codecs: VectorFormat, VectorReader, VectorWriter
o.a.l.index: VectorValues,    VectorValuesWriter, RandomAccessVectorValues, 
RandomAccessVectorValuesProducer
{code}
We seem to be evolving some consensus around {{NumericVectors}}. I think if we 
are going to have a plural root like that, it makes no sense to add {{Values}} 
after it (NumericVectorsValues?), and the "values" name was really just copied 
from DocValues - it's not adding anything I think. I'd like to just change 
"VectorValues" to "NumericVectors" and "Vector" to "NumericVectors" but this 
leaves to {{NumericVectorsWriter}} classes in different packages. Maybe we 
coulkd adopt the DocValues Producer/Consumer naming in the codecs package with 
this result:
{code:java}
o.a.l.codecs: NumericVectorsFormat, NumericVectorsProducer, 
NumericVectorsConsumer
o.a.l.index: NumericVectors,    NumericVectorsWriter, 
RandomAccessNumericVectors, RandomAccessNumericVectorsSupplier
{code}
 

> Reconsider codec name VectorFormat
> --
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Blocker
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

2021-04-05 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315066#comment-17315066
 ] 

Greg Miller commented on LUCENE-9850:
-

Thanks [~mikemccand]! For starters, yes—all the runs I referenced here are 
using "wikimediumall".
{quote}I would expect {{XTerm}} to show speedups since this is largely 
dominated by decoding many postings blocks.  But it is odd to see the 
{{XTermYSort}} tasks negatively impacted: those tasks are just sorting by a 
{{DocValues}} field instead of default text relevance (BM25).
{quote}
I might except the opposite actually. Anytime the PFOR approach has to apply 
exceptions, I would expect a performance hit of some sort since it has extra 
work to do on top of the FOR approach used today. So if the Term tasks are 
largely dominated by postings decoding, I would expect regressions to show up 
there more than elsewhere. Maybe I'm misunderstanding your comment though?

I re-ran wikimediumall with only "Term" tasks and got the following (looks like 
noise to me):
{code:java}
TaskQPS baseline  StdDevQPS pfordocids  StdDev  
  Pct diff p-value
   HighTermDayOfYearSort5.74 (10.8%)5.59  (9.9%)   
-2.6% ( -20% -   20%) 0.431
  TermDTSort   44.54 (15.4%)   44.06 (14.3%)   
-1.1% ( -26% -   33%) 0.816
HighTermTitleBDVSort   30.90 (14.3%)   30.59 (13.5%)   
-1.0% ( -25% -   31%) 0.820
 MedTerm  392.89  (6.8%)  389.45  (7.5%)   
-0.9% ( -14% -   14%) 0.699
 LowTerm  412.80  (7.0%)  410.68  (7.9%)   
-0.5% ( -14% -   15%) 0.827
PKLookup  130.70  (2.8%)  131.44  (2.0%)
0.6% (  -4% -5%) 0.470
   HighTermMonthSort   61.69 (12.2%)   62.13 (13.6%)
0.7% ( -22% -   30%) 0.860
HighTerm  381.73 (10.0%)  385.03  (7.8%)
0.9% ( -15% -   20%) 0.761
{code}
I also pulled out all the tasks in my last wikimediumall run that had a 
significant change in either direction (p-value <= 0.05) and reran them alone 
to see if the results were repeatable. In general, they were not. The three 
that *were* repeatable (significant) regressions were LowSpanNear (-2.2%), 
AndHighMed (-2.1%) and AndHighHigh (-2.0%):
{code:java}
TaskQPS baseline  StdDevQPS pfordocids  StdDev  
  Pct diff p-value
HighTermTitleBDVSort   42.09 (10.6%)   41.04 (10.4%)   
-2.5% ( -21% -   20%) 0.451
 LowSpanNear4.29  (1.9%)4.20  (1.4%)   
-2.2% (  -5% -1%) 0.000
  AndHighMed   26.79  (3.0%)   26.23  (2.4%)   
-2.1% (  -7% -3%) 0.014
 AndHighHigh   13.83  (3.4%)   13.54  (2.8%)   
-2.0% (  -7% -4%) 0.037
HighTerm  696.28  (6.7%)  688.02  (7.0%)   
-1.2% ( -13% -   13%) 0.585
   OrNotHighHigh  372.40  (5.3%)  370.23  (5.3%)   
-0.6% ( -10% -   10%) 0.726
   HighTermDayOfYearSort   13.53  (9.8%)   13.46 (10.4%)   
-0.5% ( -18% -   21%) 0.866
 MedSpanNear   19.76  (1.7%)   19.67  (1.4%)   
-0.5% (  -3% -2%) 0.320
   HighTermMonthSort   48.94 (11.2%)   48.77 (10.5%)   
-0.3% ( -19% -   24%) 0.921
 LowTerm  714.77  (3.0%)  714.27  (5.6%)   
-0.1% (  -8% -8%) 0.960
PKLookup  139.66  (3.0%)  139.69  (2.4%)
0.0% (  -5% -5%) 0.984
  IntNRQ   33.49  (1.4%)   33.54  (1.2%)
0.2% (  -2% -2%) 0.690
{code}
Finally, I tried running a micro-benchmark to see if I could isolate how much 
regression there was due to applying the exceptions in PFOR compared to FOR. I 
forked [~jpountz]'s code originally used for optimizing FOR. The results are in 
the README over 
[here|https://github.com/gsmiller/decode-128-ints-benchmark/tree/pfor-delta] if 
interested. They generally make sense to me and show that performance does take 
a hit when there are exceptions to patch in, but I think we're seeing with the 
overall luceneutil benchmark's that those performance shifts aren't showing up 
in the bigger picture for the most part.

So overall, it seems like the index size reduction might be worth it here, but 
I'm new to Lucene benchmarks and this is my first attempt at running a 
micro-benchmark, so I would trust the opinion of others with more experienced 
eyes here.

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> ---
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>

[GitHub] [lucene] gsmiller commented on a change in pull request #62: LUCENE-9902 Minor fixes to the faceting API

2021-04-05 Thread GitBox


gsmiller commented on a change in pull request #62:
URL: https://github.com/apache/lucene/pull/62#discussion_r607302178



##
File path: lucene/CHANGES.txt
##
@@ -287,7 +287,10 @@ Other
 
 API Changes
 -
-(No changes)
+
+* LUCENE-9902: Change the getValue method from IntTaxonomyFacets to be 
protected instead of private.

Review comment:
   Looks great! Have you opened up a corresponding PR against the 8x branch 
in the old repository as well for this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #62: LUCENE-9902 Minor fixes to the faceting API

2021-04-05 Thread GitBox


gautamworah96 commented on a change in pull request #62:
URL: https://github.com/apache/lucene/pull/62#discussion_r607372040



##
File path: lucene/CHANGES.txt
##
@@ -287,7 +287,10 @@ Other
 
 API Changes
 -
-(No changes)
+
+* LUCENE-9902: Change the getValue method from IntTaxonomyFacets to be 
protected instead of private.

Review comment:
   I think the next branch for Lucene 8.9 will be cut from this repo?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #62: LUCENE-9902 Minor fixes to the faceting API

2021-04-05 Thread GitBox


gsmiller commented on a change in pull request #62:
URL: https://github.com/apache/lucene/pull/62#discussion_r607379305



##
File path: lucene/CHANGES.txt
##
@@ -287,7 +287,10 @@ Other
 
 API Changes
 -
-(No changes)
+
+* LUCENE-9902: Change the getValue method from IntTaxonomyFacets to be 
protected instead of private.

Review comment:
   @gautamworah96 If I understand correctly, you need to actually submit a 
separate PR to this branch: https://github.com/apache/lucene-solr/tree/branch_8x




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #11: LUCENE-9334 Consistency of field data structures

2021-04-05 Thread GitBox


jtibshirani commented on a change in pull request #11:
URL: https://github.com/apache/lucene/pull/11#discussion_r607396127



##
File path: lucene/MIGRATE.md
##
@@ -358,11 +358,21 @@ Rather, an IllegalArgumentException shall be thrown. This 
is introduced for bett
 defence and to ensure that there is no bubbling up of errors when Lucene is
 used in multi level applications
 
-## Assumption of data consistency between different data-structures sharing 
the same field name
+### Require consistency between data-structures on a per-field basis
 
-Sorting on a numeric field that is indexed with both doc values and points may 
use an

Review comment:
   I don't think this change enforces that the same values were passed to 
each field. So maybe we could keep this note about data consistency, as it's 
not fully covered under the new one "Require consistency between data 
structures..." I'm also curious if we plan to enforce value consistency in a 
follow-up? (Sorry if I'm missing something, I'm not fully up-to-date on the PR).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9905) Revise approach to specifying NN algorithm

2021-04-05 Thread Julie Tibshirani (Jira)
Julie Tibshirani created LUCENE-9905:


 Summary: Revise approach to specifying NN algorithm
 Key: LUCENE-9905
 URL: https://issues.apache.org/jira/browse/LUCENE-9905
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Julie Tibshirani


In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
particular nearest-neighbor search data structure and algorithm. This 
flexibility is important since NN search is a developing area and we'd like to 
be able to experiment and evolve the algorithm. Right now we only have one 
algorithm (HNSW), but we want to maintain the ability to use another.

Currently the algorithm to use is specified through {{SearchStrategy}}, for 
example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation is 
expected to handle multiple algorithms. Instead we could have one format 
implementation per algorithm. Our current implementation would be HNSW-specific 
like {{HnswVectorFormat}}, and to experiment with another algorithm you could 
create a new implementation like {{ClusterVectorFormat}}. This would be better 
aligned with the codec framework, and help avoid exposing algorithm details in 
the API.

A concrete proposal (note many of these names will change when LUCENE-9855 is 
addressed):
 1. Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
 2. Remove references to HNSW in {{SearchStrategy}}, so there is just 
{{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something like 
{{SimilarityFunction}}.
 3. Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
 4. Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
parameters to be configured per-field (?)

One note: the current HNSW-based format includes logic for storing a numeric 
vector per document, as well as constructing + storing a HNSW graph. When 
adding another implementation, it’d be nice to be able to reuse logic for 
reading/ writing numeric vectors. I don’t think we need to design for this 
right now, but we can keep it in mind for the future?

This issue is based on a thread [~jpountz] started: 
[https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9905) Revise approach to specifying NN algorithm

2021-04-05 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-9905:
-
Description: 
In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
particular nearest-neighbor search data structure and algorithm. This 
flexibility is important since NN search is a developing area and we'd like to 
be able to experiment and evolve the algorithm. Right now we only have one 
algorithm (HNSW), but we want to maintain the ability to use another.

Currently the algorithm to use is specified through {{SearchStrategy}}, for 
example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation is 
expected to handle multiple algorithms. Instead we could have one format 
implementation per algorithm. Our current implementation would be HNSW-specific 
like {{HnswVectorFormat}}, and to experiment with another algorithm you could 
create a new implementation like {{ClusterVectorFormat}}. This would be better 
aligned with the codec framework, and help avoid exposing algorithm details in 
the API.

A concrete proposal (note many of these names will change when LUCENE-9855 is 
addressed):
# Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
# Remove references to HNSW in {{SearchStrategy}}, so there is just 
{{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something like 
{{SimilarityFunction}}.
# Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
# Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
parameters to be configured per-field \(?\)

One note: the current HNSW-based format includes logic for storing a numeric 
vector per document, as well as constructing + storing a HNSW graph. When 
adding another implementation, it’d be nice to be able to reuse logic for 
reading/ writing numeric vectors. I don’t think we need to design for this 
right now, but we can keep it in mind for the future?

This issue is based on a thread [~jpountz] started: 
[https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]

  was:
In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
particular nearest-neighbor search data structure and algorithm. This 
flexibility is important since NN search is a developing area and we'd like to 
be able to experiment and evolve the algorithm. Right now we only have one 
algorithm (HNSW), but we want to maintain the ability to use another.

Currently the algorithm to use is specified through {{SearchStrategy}}, for 
example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation is 
expected to handle multiple algorithms. Instead we could have one format 
implementation per algorithm. Our current implementation would be HNSW-specific 
like {{HnswVectorFormat}}, and to experiment with another algorithm you could 
create a new implementation like {{ClusterVectorFormat}}. This would be better 
aligned with the codec framework, and help avoid exposing algorithm details in 
the API.

A concrete proposal (note many of these names will change when LUCENE-9855 is 
addressed):
 1. Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
 2. Remove references to HNSW in {{SearchStrategy}}, so there is just 
{{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something like 
{{SimilarityFunction}}.
 3. Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
 4. Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
parameters to be configured per-field (?)

One note: the current HNSW-based format includes logic for storing a numeric 
vector per document, as well as constructing + storing a HNSW graph. When 
adding another implementation, it’d be nice to be able to reuse logic for 
reading/ writing numeric vectors. I don’t think we need to design for this 
right now, but we can keep it in mind for the future?

This issue is based on a thread [~jpountz] started: 
[https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]


> Revise approach to specifying NN algorithm
> --
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> par

[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat

2021-04-05 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315184#comment-17315184
 ] 

Julie Tibshirani commented on LUCENE-9855:
--

Quick note: I spun off LUCENE-9905 to discuss the approach for specifying the 
NN algorithm (question #2 in my comment above). Maybe we could hold off on a PR 
for LUCENE-9905 until we settle on names here, to avoid unnecessary conflicts.

> Reconsider codec name VectorFormat
> --
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Blocker
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9906) TestIndexSorting.testAddIndexesWithDeletionsAndDirectory can throw error

2021-04-05 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-9906:
-
Issue Type: Test  (was: Improvement)

> TestIndexSorting.testAddIndexesWithDeletionsAndDirectory can throw error
> 
>
> Key: LUCENE-9906
> URL: https://issues.apache.org/jira/browse/LUCENE-9906
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Julie Tibshirani
>Priority: Major
>
> This reproduced for me on {branch_8x}, but I couldn't reproduce on {main} 
> using the same seed.
> {code}
> ant test  -Dtestcase=TestIndexSorting 
> -Dtests.method=testAddIndexesWithDeletionsAndDirectory 
> -Dtests.seed=BED8ADB81856E61F -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=tk-TM -Dtests.timezone=Asia/Bishkek -Dtests.asserts=true 
> -Dtests.file.encoding=US-ASCII
>    [junit4] ERROR   0.83s | 
> TestIndexSorting.testAddIndexesWithDeletionsAndDirectory <<<
>    [junit4]    > Throwable #1: java.lang.RuntimeException: index sort changed 
> from , to 
>    [junit4]    >  at 
> __randomizedtesting.SeedInfo.seed([BED8ADB81856E61F:C4679CC0D9440770]:0)
>    [junit4]    >  at 
> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:650)
>    [junit4]    >  at 
> org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:301)
>    [junit4]    >  at 
> org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:867)
>    [junit4]    >  at org.apache.lucene.util.IOUtils.close(IOUtils.java:89)
>    [junit4]    >  at org.apache.lucene.util.IOUtils.close(IOUtils.java:77)
>    [junit4]    >  at 
> org.apache.lucene.index.TestIndexSorting.testAddIndexes(TestIndexSorting.java:2125)
>    [junit4]    >  at 
> org.apache.lucene.index.TestIndexSorting.testAddIndexesWithDeletionsAndDirectory(TestIndexSorting.java:2141)
>    [junit4]    >  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    [junit4]    >  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
>    [junit4]    >  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>    [junit4]    >  at 
> java.base/java.lang.reflect.Method.invoke(Method.java:564)
>    [junit4]    >  at java.base/java.lang.Thread.run(Thread.java:832)
>    [junit4]   2> NOTE: test params are: codec=Asserting(Lucene87): 
> {id=FST50}, docValues:{bar=DocValuesFormat(name=Lucene80), 
> foo=DocValuesFormat(name=Lucene80)}, maxPointsInLeafNode=1970, 
> maxMBSortInHeap=5.567472541013199, 
> sim=Asserting(RandomSimilarity(queryNorm=false): {id=DFR GB3(800.0)}), 
> locale=tk-TM, timezone=Asia/Bishkek {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9906) TestIndexSorting.testAddIndexesWithDeletionsAndDirectory can throw error

2021-04-05 Thread Julie Tibshirani (Jira)
Julie Tibshirani created LUCENE-9906:


 Summary: TestIndexSorting.testAddIndexesWithDeletionsAndDirectory 
can throw error
 Key: LUCENE-9906
 URL: https://issues.apache.org/jira/browse/LUCENE-9906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Julie Tibshirani


This reproduced for me on {branch_8x}, but I couldn't reproduce on {main} using 
the same seed.

{code}
ant test  -Dtestcase=TestIndexSorting 
-Dtests.method=testAddIndexesWithDeletionsAndDirectory 
-Dtests.seed=BED8ADB81856E61F -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=tk-TM -Dtests.timezone=Asia/Bishkek -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
   [junit4] ERROR   0.83s | 
TestIndexSorting.testAddIndexesWithDeletionsAndDirectory <<<
   [junit4]    > Throwable #1: java.lang.RuntimeException: index sort changed 
from , to 
   [junit4]    >at 
__randomizedtesting.SeedInfo.seed([BED8ADB81856E61F:C4679CC0D9440770]:0)
   [junit4]    >at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:650)
   [junit4]    >at 
org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:301)
   [junit4]    >at 
org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:867)
   [junit4]    >at org.apache.lucene.util.IOUtils.close(IOUtils.java:89)
   [junit4]    >at org.apache.lucene.util.IOUtils.close(IOUtils.java:77)
   [junit4]    >at 
org.apache.lucene.index.TestIndexSorting.testAddIndexes(TestIndexSorting.java:2125)
   [junit4]    >at 
org.apache.lucene.index.TestIndexSorting.testAddIndexesWithDeletionsAndDirectory(TestIndexSorting.java:2141)
   [junit4]    >at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
   [junit4]    >at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >at 
java.base/java.lang.reflect.Method.invoke(Method.java:564)
   [junit4]    >at java.base/java.lang.Thread.run(Thread.java:832)
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene87): {id=FST50}, 
docValues:{bar=DocValuesFormat(name=Lucene80), 
foo=DocValuesFormat(name=Lucene80)}, maxPointsInLeafNode=1970, 
maxMBSortInHeap=5.567472541013199, 
sim=Asserting(RandomSimilarity(queryNorm=false): {id=DFR GB3(800.0)}), 
locale=tk-TM, timezone=Asia/Bishkek {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9906) TestIndexSorting.testAddIndexesWithDeletionsAndDirectory can throw error

2021-04-05 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-9906:
-
Description: 
This reproduced for me on {{branch_8x}} , but I couldn't reproduce on {{main}} 
using the same seed.

{code}
ant test  -Dtestcase=TestIndexSorting 
-Dtests.method=testAddIndexesWithDeletionsAndDirectory 
-Dtests.seed=BED8ADB81856E61F -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=tk-TM -Dtests.timezone=Asia/Bishkek -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
   [junit4] ERROR   0.83s | 
TestIndexSorting.testAddIndexesWithDeletionsAndDirectory <<<
   [junit4]    > Throwable #1: java.lang.RuntimeException: index sort changed 
from , to 
   [junit4]    >at 
__randomizedtesting.SeedInfo.seed([BED8ADB81856E61F:C4679CC0D9440770]:0)
   [junit4]    >at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:650)
   [junit4]    >at 
org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:301)
   [junit4]    >at 
org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:867)
   [junit4]    >at org.apache.lucene.util.IOUtils.close(IOUtils.java:89)
   [junit4]    >at org.apache.lucene.util.IOUtils.close(IOUtils.java:77)
   [junit4]    >at 
org.apache.lucene.index.TestIndexSorting.testAddIndexes(TestIndexSorting.java:2125)
   [junit4]    >at 
org.apache.lucene.index.TestIndexSorting.testAddIndexesWithDeletionsAndDirectory(TestIndexSorting.java:2141)
   [junit4]    >at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
   [junit4]    >at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >at 
java.base/java.lang.reflect.Method.invoke(Method.java:564)
   [junit4]    >at java.base/java.lang.Thread.run(Thread.java:832)
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene87): {id=FST50}, 
docValues:{bar=DocValuesFormat(name=Lucene80), 
foo=DocValuesFormat(name=Lucene80)}, maxPointsInLeafNode=1970, 
maxMBSortInHeap=5.567472541013199, 
sim=Asserting(RandomSimilarity(queryNorm=false): {id=DFR GB3(800.0)}), 
locale=tk-TM, timezone=Asia/Bishkek {code}

  was:
This reproduced for me on {branch_8x}, but I couldn't reproduce on {main} using 
the same seed.

{code}
ant test  -Dtestcase=TestIndexSorting 
-Dtests.method=testAddIndexesWithDeletionsAndDirectory 
-Dtests.seed=BED8ADB81856E61F -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=tk-TM -Dtests.timezone=Asia/Bishkek -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
   [junit4] ERROR   0.83s | 
TestIndexSorting.testAddIndexesWithDeletionsAndDirectory <<<
   [junit4]    > Throwable #1: java.lang.RuntimeException: index sort changed 
from , to 
   [junit4]    >at 
__randomizedtesting.SeedInfo.seed([BED8ADB81856E61F:C4679CC0D9440770]:0)
   [junit4]    >at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:650)
   [junit4]    >at 
org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:301)
   [junit4]    >at 
org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:867)
   [junit4]    >at org.apache.lucene.util.IOUtils.close(IOUtils.java:89)
   [junit4]    >at org.apache.lucene.util.IOUtils.close(IOUtils.java:77)
   [junit4]    >at 
org.apache.lucene.index.TestIndexSorting.testAddIndexes(TestIndexSorting.java:2125)
   [junit4]    >at 
org.apache.lucene.index.TestIndexSorting.testAddIndexesWithDeletionsAndDirectory(TestIndexSorting.java:2141)
   [junit4]    >at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
   [junit4]    >at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >at 
java.base/java.lang.reflect.Method.invoke(Method.java:564)
   [junit4]    >at java.base/java.lang.Thread.run(Thread.java:832)
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene87): {id=FST50}, 
docValues:{bar=DocValuesFormat(name=Lucene80), 
foo=DocValuesFormat(name=Lucene80)}, maxPointsInLeafNode=1970, 
maxMBSortInHeap=5.567472541013199, 
sim=Asserting(RandomSimilarity(queryNorm=false): {id=DFR GB3(800.0)}), 
locale=tk-TM, timezone=Asia/Bishkek {code}


> TestIndexSorting.testAddIndexesWithDeletionsAndDirectory can throw error
> 
>
> Key: LUCENE-9906
> URL: https://issues.apache.org/jira/browse/LUCENE-9906
> Project: Lucene - Core

[jira] [Updated] (LUCENE-9855) Reconsider codec name VectorFormat

2021-04-05 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9855:
--
Fix Version/s: main (9.0)

> Reconsider codec name VectorFormat
> --
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Blocker
> Fix For: main (9.0)
>
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org