[jira] [Updated] (SOLR-12490) Introducing json.queries WAS:Query DSL supports for further referring and exclusion in JSON facets

2020-01-11 Thread Mikhail Khludnev (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated SOLR-12490:

Attachment: SOLR-12490-ref-guide.patch

> Introducing json.queries WAS:Query DSL supports for further referring and 
> exclusion in JSON facets 
> ---
>
> Key: SOLR-12490
> URL: https://issues.apache.org/jira/browse/SOLR-12490
> Project: Solr
>  Issue Type: Improvement
>  Components: Facet Module, faceting
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
>Priority: Major
>  Labels: newdev
> Fix For: 8.5
>
> Attachments: SOLR-12490-ref-guide.patch, SOLR-12490-ref-guide.patch, 
> SOLR-12490.patch, SOLR-12490.patch, SOLR-12490.patch, SOLR-12490.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> It's spin off from the 
> [discussion|https://issues.apache.org/jira/browse/SOLR-9685?focusedCommentId=16508720&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16508720].
>  
> h2. Problem
> # after SOLR-9685 we can tag separate clauses in hairish queries like 
> {{parent}}, {{bool}}
> # we can {{domain.excludeTags}}
> # we are looking for child faceting with exclusions, see SOLR-9510, SOLR-8998 
>
> # but we can refer only separate params in {{domain.filter}}, it's not 
> possible to refer separate clauses
> see the first comment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12490) Introducing json.queries WAS:Query DSL supports for further referring and exclusion in JSON facets

2020-01-11 Thread Mikhail Khludnev (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013404#comment-17013404
 ] 

Mikhail Khludnev commented on SOLR-12490:
-

Attaching Fixed Ref Guide patch [^SOLR-12490-ref-guide.patch]. It also fixes a 
few broken refs to Json Facet API page.
[~ctargett], would you like to review it before I push? 

> Introducing json.queries WAS:Query DSL supports for further referring and 
> exclusion in JSON facets 
> ---
>
> Key: SOLR-12490
> URL: https://issues.apache.org/jira/browse/SOLR-12490
> Project: Solr
>  Issue Type: Improvement
>  Components: Facet Module, faceting
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
>Priority: Major
>  Labels: newdev
> Fix For: 8.5
>
> Attachments: SOLR-12490-ref-guide.patch, SOLR-12490-ref-guide.patch, 
> SOLR-12490.patch, SOLR-12490.patch, SOLR-12490.patch, SOLR-12490.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> It's spin off from the 
> [discussion|https://issues.apache.org/jira/browse/SOLR-9685?focusedCommentId=16508720&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16508720].
>  
> h2. Problem
> # after SOLR-9685 we can tag separate clauses in hairish queries like 
> {{parent}}, {{bool}}
> # we can {{domain.excludeTags}}
> # we are looking for child faceting with exclusions, see SOLR-9510, SOLR-8998 
>
> # but we can refer only separate params in {{domain.filter}}, it's not 
> possible to refer separate clauses
> see the first comment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dweiss commented on issue #1157: Add RAT check using Gradle

2020-01-11 Thread GitBox
dweiss commented on issue #1157: Add RAT check using Gradle
URL: https://github.com/apache/lucene-solr/pull/1157#issuecomment-573306410
 
 
   I'll take a look later, Mike. As for applying tasks and anything else -- 
think of the project structure as a graph. You attach things to this graph in 
two passes (evaluation, configuration), followed by execution of tasks attached 
to this graph (in topological order of dependencies).
   
   It is conceptually simple. The devil hides in details of how gradle is 
evaluated, deferred evaluated-collections, etc. This should be helpful:
   
   https://docs.gradle.org/current/userguide/build_lifecycle.html
   
   I'll review the patch and maybe correct it before committing; when you take 
a look at the commit vs. your patch you'll see the differences made - I think 
it'll be easier and faster than explaining (but go ahead and ask if you don't 
understand something).
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-12490) Introducing json.queries was:Query DSL supports for further referring and exclusion in JSON facets

2020-01-11 Thread Mikhail Khludnev (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated SOLR-12490:

Summary: Introducing json.queries was:Query DSL supports for further 
referring and exclusion in JSON facets   (was: Introducing json.queries 
WAS:Query DSL supports for further referring and exclusion in JSON facets )

> Introducing json.queries was:Query DSL supports for further referring and 
> exclusion in JSON facets 
> ---
>
> Key: SOLR-12490
> URL: https://issues.apache.org/jira/browse/SOLR-12490
> Project: Solr
>  Issue Type: Improvement
>  Components: Facet Module, faceting
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
>Priority: Major
>  Labels: newdev
> Fix For: 8.5
>
> Attachments: SOLR-12490-ref-guide.patch, SOLR-12490-ref-guide.patch, 
> SOLR-12490.patch, SOLR-12490.patch, SOLR-12490.patch, SOLR-12490.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> It's spin off from the 
> [discussion|https://issues.apache.org/jira/browse/SOLR-9685?focusedCommentId=16508720&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16508720].
>  
> h2. Problem
> # after SOLR-9685 we can tag separate clauses in hairish queries like 
> {{parent}}, {{bool}}
> # we can {{domain.excludeTags}}
> # we are looking for child faceting with exclusions, see SOLR-9510, SOLR-8998 
>
> # but we can refer only separate params in {{domain.filter}}, it's not 
> possible to refer separate clauses
> see the first comment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13934) Documentation on SimplePostTool for Windows users is pretty brief

2020-01-11 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013497#comment-17013497
 ] 

David Eric Pugh commented on SOLR-13934:


The editorial changes you made look great!   Changing up the code should 
probably be a new JIRA.

> Documentation on SimplePostTool for Windows users is pretty brief
> -
>
> Key: SOLR-13934
> URL: https://issues.apache.org/jira/browse/SOLR-13934
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SimplePostTool
>Affects Versions: 8.3
>Reporter: David Eric Pugh
>Assignee: Jason Gerlowski
>Priority: Minor
> Fix For: master (9.0)
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> SimplePostTool on windows doesn't have enough documentation, you end up 
> googling to get it to work.  Need to provide better example.
> https://lucene.apache.org/solr/guide/8_3/post-tool.html#simpleposttool



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9126) Javadoc linting options silently swallow documentation errors

2020-01-11 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013505#comment-17013505
 ] 

Dawid Weiss commented on LUCENE-9126:
-

Jon filed a bug for us.
https://bugs.openjdk.java.net/browse/JDK-8236949

> Javadoc linting options silently swallow documentation errors
> -
>
> Key: LUCENE-9126
> URL: https://issues.apache.org/jira/browse/LUCENE-9126
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
>
> I tried to compile javadocs in gradle and I couldn't do it... The output was 
> full of errors.
> I eventually narrowed the problem down to lint options – how they are 
> interpreted and parsed just doesn't make any sense to me. Try this:
> {code}
> # Examples below use plain javadoc from Java 11.
> cd lucene/core
> {code}
> This emulates what we have in Ant (this is roughly the options Ant emits):
> {code}
> javadoc -d build\output -encoding "UTF-8" -sourcepath src\java -subpackages 
> org -quiet -Xdoclint:all -Xdoclint:-missing -Xdoclint:-accessibility
> => no errors.
> {code}
> Now rerun it with this syntax:
> {code}
> javadoc -d build\output -encoding "UTF-8" -sourcepath src\java -subpackages 
> org -quiet -Xdoclint:all,-missing,-accessibility
> => 100 errors, 5 warnings
> {code}
> This time javadoc displays errors about undefined tags (unknown tag: 
> lucene.experimental), HTML warnings (warning: empty  tag), etc.
> Let's add our custom tags and add overview file:
> {code}
> javadoc -overview "src/java/overview.html" -tag "lucene.experimental:a:xxx" 
> -tag "lucene.internal:a:xxx" -tag "lucene.spi:t:xxx" -d build\output 
> -encoding "UTF-8" -sourcepath src\java -subpackages org -quiet 
> -Xdoclint:all,-missing,-accessibility
> => 100 errors, 5 warnings
> => still HTML warnings
> {code}
> Let's get rid of html linting:
> {code}
> javadoc -overview "src/java/overview.html" -tag "lucene.experimental:a:xxx" 
> -tag "lucene.internal:a:xxx" -tag "lucene.spi:t:xxx" -d build\output 
> -encoding "UTF-8" -sourcepath src\java -subpackages org -quiet 
> -Xdoclint:all,-missing,-accessibility,-html
> => 3 errors
> => malformed HTML syntax in overview.html: src\java\overview.html:150: error: 
> bad use of '>' (>)
> {code}
> Finally, let's get rid of syntax linting:
> {code}
> javadoc -overview "src/java/overview.html" -tag "lucene.experimental:a:xxx" 
> -tag "lucene.internal:a:xxx" -tag "lucene.spi:t:xxx" -d build\output 
> -encoding "UTF-8" -sourcepath src\java -subpackages org -quiet 
> -Xdoclint:all,-missing,-accessibility,-html,-syntax
> => passes
> {code}
> There are definitely bugs in our documentation -- look at the extra ">" in 
> the overview file, for example:
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/overview.html#L150
> What I can't understand is why the first syntax suppresses pretty much ALL 
> the errors, including missing custom tag definitions. This should work, given 
> what's written in [1]?
> [1] https://docs.oracle.com/en/java/javase/11/tools/javadoc.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-9126) Javadoc linting options silently swallow documentation errors

2020-01-11 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss reassigned LUCENE-9126:
---

Assignee: Dawid Weiss

> Javadoc linting options silently swallow documentation errors
> -
>
> Key: LUCENE-9126
> URL: https://issues.apache.org/jira/browse/LUCENE-9126
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
>
> I tried to compile javadocs in gradle and I couldn't do it... The output was 
> full of errors.
> I eventually narrowed the problem down to lint options – how they are 
> interpreted and parsed just doesn't make any sense to me. Try this:
> {code}
> # Examples below use plain javadoc from Java 11.
> cd lucene/core
> {code}
> This emulates what we have in Ant (this is roughly the options Ant emits):
> {code}
> javadoc -d build\output -encoding "UTF-8" -sourcepath src\java -subpackages 
> org -quiet -Xdoclint:all -Xdoclint:-missing -Xdoclint:-accessibility
> => no errors.
> {code}
> Now rerun it with this syntax:
> {code}
> javadoc -d build\output -encoding "UTF-8" -sourcepath src\java -subpackages 
> org -quiet -Xdoclint:all,-missing,-accessibility
> => 100 errors, 5 warnings
> {code}
> This time javadoc displays errors about undefined tags (unknown tag: 
> lucene.experimental), HTML warnings (warning: empty  tag), etc.
> Let's add our custom tags and add overview file:
> {code}
> javadoc -overview "src/java/overview.html" -tag "lucene.experimental:a:xxx" 
> -tag "lucene.internal:a:xxx" -tag "lucene.spi:t:xxx" -d build\output 
> -encoding "UTF-8" -sourcepath src\java -subpackages org -quiet 
> -Xdoclint:all,-missing,-accessibility
> => 100 errors, 5 warnings
> => still HTML warnings
> {code}
> Let's get rid of html linting:
> {code}
> javadoc -overview "src/java/overview.html" -tag "lucene.experimental:a:xxx" 
> -tag "lucene.internal:a:xxx" -tag "lucene.spi:t:xxx" -d build\output 
> -encoding "UTF-8" -sourcepath src\java -subpackages org -quiet 
> -Xdoclint:all,-missing,-accessibility,-html
> => 3 errors
> => malformed HTML syntax in overview.html: src\java\overview.html:150: error: 
> bad use of '>' (>)
> {code}
> Finally, let's get rid of syntax linting:
> {code}
> javadoc -overview "src/java/overview.html" -tag "lucene.experimental:a:xxx" 
> -tag "lucene.internal:a:xxx" -tag "lucene.spi:t:xxx" -d build\output 
> -encoding "UTF-8" -sourcepath src\java -subpackages org -quiet 
> -Xdoclint:all,-missing,-accessibility,-html,-syntax
> => passes
> {code}
> There are definitely bugs in our documentation -- look at the extra ">" in 
> the overview file, for example:
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/overview.html#L150
> What I can't understand is why the first syntax suppresses pretty much ALL 
> the errors, including missing custom tag definitions. This should work, given 
> what's written in [1]?
> [1] https://docs.oracle.com/en/java/javase/11/tools/javadoc.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)

2020-01-11 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013597#comment-17013597
 ] 

Chris M. Hostetter commented on SOLR-13486:
---

I've been revisiting this aspect of my earlier investigation into this bug...
{quote}{color:#de350b}*Why does the leader _need_ to do tlog replay in the test 
at all?*{color}

Even if the client doesn't expilicitly commit all docs, the "Commit on Close" 
semantics of Solr's IndexWriter should ensure that a clean shutdown of the 
leader means all uncommitted docs in the tlog will be automaticaly committed 
before the Directory is closed – nothing in the test "kills" the leader before 
this should happen.

So WTF?

I still haven't gotten to the bottom of that, but I did confirm that:
 * unlike the "normal" adds for docs 1-3, the code path in TestCloudConsistency 
that was adding doc #4 (during hte network partition) was *NOT* committing 
doc#4.
 * in the test logs where TestCloudConsistency failed, we never see the normal 
"Committing on IndexWriter close." i would expect from an oderly shutdown of 
the leader
 ** This message does appear in the expected location of the logs for a 
TestCloudConsistency run that passes

At first I thought the problem was some other test class running earlier in the 
same jenkins JVM mucking with the value of the (public static) 
{{DirectUpdateHandler2.commitOnClose}} prior to the test running – but even 
when running a single test class locally, with 
{{DirectUpdateHandler2.commitOnClose = true;}} i was able to continue to 
reproduce the problem in my new test.
{quote}
I've been trying to get to the bottom of this by modifying 
{{TestTlogReplayVsRecovery}} to explicitly use 
{{DirectUpdateHandler2.commitOnClose = true;}} (as mentioned above) along with 
more detailed logging from org.apache.solr.update (particularly DUH2)

The first thing I realized is that there's a bug in the test where it's 
expecting to find {{uncommittedDocs + uncommittedDocs}} docs, not just 
{{committedDocs + uncommittedDocs}}, which is why it so easily/quickly failed 
for me before.

With that trivial test bug fixed, I have *NOT* been able to reproduce the 
situation that was observed in {{TestCloudConsistency}} when this jira was 
filed: That the leader shutdown (evidently) w/o doing a commitOnClose, 
necessitating tlog replay on startup, which then happenes after a replica did 
recovery.

The only way I can seem to trigger this situation is when 
{{DirectUpdateHandler2.commitOnClose = false;}} (ie: simulating an unclean 
shutdown) suggesting that maybe my original guess about some other test in the 
same JVM borking this seeing was correct ... but I still haven't been able to 
find a test that ran in the same JVM which might be broken in that way

The only failure type I've been able to trigger is a new one AFAICT:
 * (partitioned) leader successfully indexes some docs & commits on shutdown
 * leader re-starts, and sends {{REQUESTRECOVERY}} to replica
 * leader marks itself as active
 * test thread detects "all replicas are active" *before* replica has a chance 
to actually go into recovery
 * test thread checks replica for docs that only leader has, and fails

...ironically I've only been able to reproduce this using 
{{TestTlogReplayVsRecovery}} – I've never seen it in {{TestCloudConsistency}} 
even though it seems like that test establishes the same preconditions? 
(Successful logs of {{TestCloudConsistency}} never show a {{REQUESTRECOVERY}} 
command sent to the replicas from the leader, like I see in (both success and 
failure) logs for {{TestTlogReplayVsRecovery}}, so I'm guessing it has to do 
with how many docs are out of sync and what type of recovery is done? ... not 
certain)

My next steps are:
 * Commit a fix for the {{uncommittedDocs + uncommittedDocs}} bug in 
{{TestTlogReplayVsRecovery}}
 ** This will also include some TODOs about making the test more robust with 
more randomized committed & uncommitted docs before/after the network partition
 *** These TODOs aren't really worth pursuing until the underlying bug is fixed
 * Open new jiras for:
 ** Replacing {{DirectUpdateHandler2.commitOnClose}} with something in 
{{TestInjection}} (per comment there)
 *** so we can be more confident tests aren't leaving it in a bad state
 ** Consider setting replica to {{State.RECOVERYING}} synchronously when 
processing {{REQUESTRECOVERY}} command.
 *** w/o this, even if we fix the bug tracked in this issue, it's still 
impossible for tests like {{TestTlogReplayVsRecovery}} – or end users – to set 
CollectionState watchers to know when a collection is healthy in situations 
like the one being tracked in this jira.

After that, i don't think there's any thing else to do until someone smarter 
then me can chime in about fixing the underlying race condition of (leader) 
"tlog replay on startup" vs (replica) "recover from leader".

> race 

[jira] [Created] (SOLR-14183) replicas do not immediately/synchronously reflect state=RECOVERYING when recieving REQUESTRECOVERY commands

2020-01-11 Thread Chris M. Hostetter (Jira)
Chris M. Hostetter created SOLR-14183:
-

 Summary: replicas do not immediately/synchronously reflect 
state=RECOVERYING when recieving REQUESTRECOVERY commands
 Key: SOLR-14183
 URL: https://issues.apache.org/jira/browse/SOLR-14183
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Chris M. Hostetter


Spun off of SOLR-13486: Consider the following situation, which can occur in 
{{TestTlogReplayVsRecovery}}
 * healthy cluster, healthy shard with multiple replicas
 * network partition occurs, leader adds new documents
 * network partition is healed, leader is restarted
 * leader determines it should be leader again
 ** sends {{REQUESTRECOVERY}} to replicas
 ** leader marks itself as {{state=ACTIVE}}
 * client checks cluster status and sees all replicas are {{ACTIVE}}
 ** client assumes all replicas are far game for searching all documents
 ** *CLIENT FAILS TO FIND EXPECTED DOCUMENTS IF QUERYING NON-LEADER REPLICA*
 * asynchronously, non-leader replicas get around to {{doRecovery}}
 ** only now are non-leader replicas marking themselves as {{state=RECOVERING}}


I think we need to reconsider when replicas are marked {{state=RECOVERING}}, 
either doing it synchronously in {{CoreAdminOperation.REQUESTRECOVERY_OP}}, or 
letting the leader set it when the leader knows it needs to initiate recovery, 
so that the status is updated and available to clients (and tests) immediately.

Alternatively: we need a more comprehensive way for clients (and tests) to know 
if a shard is "healthy" then just checking the state of each replica (since 
setting {{state=RECOVERING}} isn't updated in real time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14184) replace DirectUpdateHandler2.commitOnClose with something in TestInjection

2020-01-11 Thread Chris M. Hostetter (Jira)
Chris M. Hostetter created SOLR-14184:
-

 Summary: replace DirectUpdateHandler2.commitOnClose with something 
in TestInjection
 Key: SOLR-14184
 URL: https://issues.apache.org/jira/browse/SOLR-14184
 Project: Solr
  Issue Type: Test
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Chris M. Hostetter
Assignee: Chris M. Hostetter


{code:java}
public static volatile boolean commitOnClose = true;  // TODO: make this a real 
config option or move it to TestInjection
{code}

Lots of tests muck with this (to simulate unclean shutdown and force tlog 
replay on restart) but there's no garuntee that it is reset properly.

It should be replaced by logic in {{TestInjection}} that is correctly cleaned 
up by {{TestInjection.reset()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)

2020-01-11 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013603#comment-17013603
 ] 

ASF subversion and git services commented on SOLR-13486:


Commit 9a2497f6377601d396b1b3b8b83ffcab0fd331a3 in lucene-solr's branch 
refs/heads/master from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=9a2497f ]

SOLR-13486: Fix trivial test bug in TestTlogReplayVsRecovery

Add TODOs for future test improvements once underlying race condition is fixed 
in core code


> race condition between leader's "replay on startup" and non-leader's "recover 
> from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)
> 
>
> Key: SOLR-13486
> URL: https://issues.apache.org/jira/browse/SOLR-13486
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-13486__test.patch, 
> apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, 
> apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, 
> org.apache.solr.cloud.TestCloudConsistency.zip
>
>
> There is a bug in solr cloud that can result in replicas being out of sync 
> with the leader if:
>  * The leader has uncommitted docs (in the tlog) that didn't make it to the 
> replica
>  * The leader restarts
>  * The replica begins to peer sync from the leader before the leader finishes 
> it's own tlog replay on startup
> A "rolling restart" situation is when this is most likeley to affect real 
> world users
> This was first discovered via hard to reproduce TestCloudConsistency failures 
> in jenkins, but that test has since been modified to work around this bug, 
> and a new test "TestTlogReplayVsRecovery" has been added that more 
> aggressively demonstrates this error.
> Original jira description below...
> 
> I've been investigating some jenkins failures from TestCloudConsistency, 
> which at first glance suggest a problem w/replica(s) recovering after a 
> network partition from the leader - but in digging into the logs the root 
> cause acturally seems to be a thread race conditions when a replica (the 
> leader) is first registered...
>  * The {{ZkContainer.registerInZk(...)}} method (which is called by 
> {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically 
> run in a background thread (via the {{ZkContainer.coreZkRegister}} 
> ExecutorService)
>  * {{ZkContainer.registerInZk(...)}} delegates to 
> {{ZKController.register(...)}} which is ultimately responsible for checking 
> if there are any "old" tlogs on disk, and if so handling the "Replaying tlog 
> for  during startup" logic
>  * Because this happens in a background thread, other logic/requests can be 
> handled by this core/replica in the meantime - before it starts (or while in 
> the middle of) replaying the tlogs
>  ** Notably: *leader's that have not yet replayed tlogs on startup will 
> erroneously respond to RTG / Fingerprint / PeerSync requests from other 
> replicas w/incomplete data*
> ...In general, it seems scary / fishy to me that a replica can (aparently) 
> become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog 
> ... during startup" logic ... particularly since this can happen even for 
> replicas that are/become leaders. It seems like this could potentially cause 
> a whole host of problems, only one of which manifests in this particular test 
> failure:
>  * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog 
> ... during startup" check:
>  ** replicaX can recognize (via zk terms) that it should be the leader(X)
>  ** this leaderX can then instruct some other replicaY to recover from it
>  ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX 
> (either on it's own volition, or because it was instructed to by leaderX) in 
> an attempt to recover
>  *** the responses to these recovery requests will not include updates in the 
> tlog files that existed on leaderX prior to startup that hvae not yet been 
> replayed
>  * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... 
> during startup" can finish
>  ** replicaY now thinks it is in sync with leaderX, but leaderX has 
> (replayed) updates the other replicas know nothing about



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)

2020-01-11 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013613#comment-17013613
 ] 

ASF subversion and git services commented on SOLR-13486:


Commit 23fab1b6ebc08dab54f2937d2886fdc9c270711c in lucene-solr's branch 
refs/heads/branch_8x from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=23fab1b ]

SOLR-13486: Fix trivial test bug in TestTlogReplayVsRecovery

Add TODOs for future test improvements once underlying race condition is fixed 
in core code

(cherry picked from commit 9a2497f6377601d396b1b3b8b83ffcab0fd331a3)


> race condition between leader's "replay on startup" and non-leader's "recover 
> from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)
> 
>
> Key: SOLR-13486
> URL: https://issues.apache.org/jira/browse/SOLR-13486
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-13486__test.patch, 
> apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, 
> apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, 
> org.apache.solr.cloud.TestCloudConsistency.zip
>
>
> There is a bug in solr cloud that can result in replicas being out of sync 
> with the leader if:
>  * The leader has uncommitted docs (in the tlog) that didn't make it to the 
> replica
>  * The leader restarts
>  * The replica begins to peer sync from the leader before the leader finishes 
> it's own tlog replay on startup
> A "rolling restart" situation is when this is most likeley to affect real 
> world users
> This was first discovered via hard to reproduce TestCloudConsistency failures 
> in jenkins, but that test has since been modified to work around this bug, 
> and a new test "TestTlogReplayVsRecovery" has been added that more 
> aggressively demonstrates this error.
> Original jira description below...
> 
> I've been investigating some jenkins failures from TestCloudConsistency, 
> which at first glance suggest a problem w/replica(s) recovering after a 
> network partition from the leader - but in digging into the logs the root 
> cause acturally seems to be a thread race conditions when a replica (the 
> leader) is first registered...
>  * The {{ZkContainer.registerInZk(...)}} method (which is called by 
> {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically 
> run in a background thread (via the {{ZkContainer.coreZkRegister}} 
> ExecutorService)
>  * {{ZkContainer.registerInZk(...)}} delegates to 
> {{ZKController.register(...)}} which is ultimately responsible for checking 
> if there are any "old" tlogs on disk, and if so handling the "Replaying tlog 
> for  during startup" logic
>  * Because this happens in a background thread, other logic/requests can be 
> handled by this core/replica in the meantime - before it starts (or while in 
> the middle of) replaying the tlogs
>  ** Notably: *leader's that have not yet replayed tlogs on startup will 
> erroneously respond to RTG / Fingerprint / PeerSync requests from other 
> replicas w/incomplete data*
> ...In general, it seems scary / fishy to me that a replica can (aparently) 
> become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog 
> ... during startup" logic ... particularly since this can happen even for 
> replicas that are/become leaders. It seems like this could potentially cause 
> a whole host of problems, only one of which manifests in this particular test 
> failure:
>  * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog 
> ... during startup" check:
>  ** replicaX can recognize (via zk terms) that it should be the leader(X)
>  ** this leaderX can then instruct some other replicaY to recover from it
>  ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX 
> (either on it's own volition, or because it was instructed to by leaderX) in 
> an attempt to recover
>  *** the responses to these recovery requests will not include updates in the 
> tlog files that existed on leaderX prior to startup that hvae not yet been 
> replayed
>  * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... 
> during startup" can finish
>  ** replicaY now thinks it is in sync with leaderX, but leaderX has 
> (replayed) updates the other replicas know nothing about



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)

2020-01-11 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013614#comment-17013614
 ] 

Chris M. Hostetter commented on SOLR-13486:
---

New linked jiras:
* SOLR-14183: replicas do not immediately/synchronously reflect 
state=RECOVERYING when recieving REQUESTRECOVERY commands
* SOLR-14184: replace DirectUpdateHandler2.commitOnClose with something in 
TestInjection

> race condition between leader's "replay on startup" and non-leader's "recover 
> from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)
> 
>
> Key: SOLR-13486
> URL: https://issues.apache.org/jira/browse/SOLR-13486
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-13486__test.patch, 
> apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, 
> apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, 
> org.apache.solr.cloud.TestCloudConsistency.zip
>
>
> There is a bug in solr cloud that can result in replicas being out of sync 
> with the leader if:
>  * The leader has uncommitted docs (in the tlog) that didn't make it to the 
> replica
>  * The leader restarts
>  * The replica begins to peer sync from the leader before the leader finishes 
> it's own tlog replay on startup
> A "rolling restart" situation is when this is most likeley to affect real 
> world users
> This was first discovered via hard to reproduce TestCloudConsistency failures 
> in jenkins, but that test has since been modified to work around this bug, 
> and a new test "TestTlogReplayVsRecovery" has been added that more 
> aggressively demonstrates this error.
> Original jira description below...
> 
> I've been investigating some jenkins failures from TestCloudConsistency, 
> which at first glance suggest a problem w/replica(s) recovering after a 
> network partition from the leader - but in digging into the logs the root 
> cause acturally seems to be a thread race conditions when a replica (the 
> leader) is first registered...
>  * The {{ZkContainer.registerInZk(...)}} method (which is called by 
> {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically 
> run in a background thread (via the {{ZkContainer.coreZkRegister}} 
> ExecutorService)
>  * {{ZkContainer.registerInZk(...)}} delegates to 
> {{ZKController.register(...)}} which is ultimately responsible for checking 
> if there are any "old" tlogs on disk, and if so handling the "Replaying tlog 
> for  during startup" logic
>  * Because this happens in a background thread, other logic/requests can be 
> handled by this core/replica in the meantime - before it starts (or while in 
> the middle of) replaying the tlogs
>  ** Notably: *leader's that have not yet replayed tlogs on startup will 
> erroneously respond to RTG / Fingerprint / PeerSync requests from other 
> replicas w/incomplete data*
> ...In general, it seems scary / fishy to me that a replica can (aparently) 
> become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog 
> ... during startup" logic ... particularly since this can happen even for 
> replicas that are/become leaders. It seems like this could potentially cause 
> a whole host of problems, only one of which manifests in this particular test 
> failure:
>  * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog 
> ... during startup" check:
>  ** replicaX can recognize (via zk terms) that it should be the leader(X)
>  ** this leaderX can then instruct some other replicaY to recover from it
>  ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX 
> (either on it's own volition, or because it was instructed to by leaderX) in 
> an attempt to recover
>  *** the responses to these recovery requests will not include updates in the 
> tlog files that existed on leaderX prior to startup that hvae not yet been 
> replayed
>  * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... 
> during startup" can finish
>  ** replicaY now thinks it is in sync with leaderX, but leaderX has 
> (replayed) updates the other replicas know nothing about



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2020-01-11 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013667#comment-17013667
 ] 

Tomoko Uchida commented on LUCENE-9004:
---

[~sokolov] thanks, I myself also have tested it with a real dataset that is 
generated from recent snapshot files of Japanese Wikipedia. Yes it seems like 
"functionally correct", although we should do more formal tests for measuring 
Recall (effectiveness). 
{quote}I think it's time to post back to a branch in the Apache git repository 
so we can enlist contributions from the community here to help this go forward. 
I'll try to get that done this weekend
{quote}
OK, I pushed the branch to the Apache Gitbox to let others who want to involve 
in this issue check out it and have a try. 
 
[https://gitbox.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/lucene-9004-aknn-2]
 This also includes a patch Xin-Chun Zhang. 
 Note: currently the new codec for the vectors and kNN graphs is placed in 
{{o.a.l.codecs.lucene90}}, I think we can move this to proper location when 
this is ready to be released.

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a singl

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2020-01-11 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013667#comment-17013667
 ] 

Tomoko Uchida edited comment on LUCENE-9004 at 1/12/20 7:51 AM:


[~sokolov] thanks, I myself also have tested it with a real dataset that is 
generated from recent snapshot files of Japanese Wikipedia. Yes it seems like 
"functionally correct", although we should do more formal tests for measuring 
Recall (effectiveness).
{quote}I think it's time to post back to a branch in the Apache git repository 
so we can enlist contributions from the community here to help this go forward. 
I'll try to get that done this weekend
{quote}
OK, I pushed the branch to the Apache Gitbox to let others who want to involve 
in this issue check out it and have a try. While I feel it's far from being 
complete :), but agree with that the code is prepared to take in contributions 
from the community.
 
[https://gitbox.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/lucene-9004-aknn-2]
 This also includes a patch from Xin-Chun Zhang. 
 Note: currently the new codec for the vectors and kNN graphs is placed in 
{{o.a.l.codecs.lucene90}}, I think we can move this to proper location when 
this is ready to be released.


was (Author: tomoko uchida):
[~sokolov] thanks, I myself also have tested it with a real dataset that is 
generated from recent snapshot files of Japanese Wikipedia. Yes it seems like 
"functionally correct", although we should do more formal tests for measuring 
Recall (effectiveness). 
{quote}I think it's time to post back to a branch in the Apache git repository 
so we can enlist contributions from the community here to help this go forward. 
I'll try to get that done this weekend
{quote}
OK, I pushed the branch to the Apache Gitbox to let others who want to involve 
in this issue check out it and have a try. 
 
[https://gitbox.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/lucene-9004-aknn-2]
 This also includes a patch Xin-Chun Zhang. 
 Note: currently the new codec for the vectors and kNN graphs is placed in 
{{o.a.l.codecs.lucene90}}, I think we can move this to proper location when 
this is ready to be released.

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would

[jira] [Updated] (LUCENE-9004) Approximate nearest vector search

2020-01-11 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9004:
--
Description: 
"Semantic" search based on machine-learned vector "embeddings" representing 
terms, queries and documents is becoming a must-have feature for a modern 
search engine. SOLR-12890 is exploring various approaches to this, including 
providing vector-based scoring functions. This is a spinoff issue from that.

The idea here is to explore approximate nearest-neighbor search. Researchers 
have found an approach based on navigating a graph that partially encodes the 
nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
compared to exact nearest neighbor calculations) at a reasonable cost. This 
issue will explore implementing HNSW (hierarchical navigable small-world) 
graphs for the purpose of approximate nearest vector search (often referred to 
as KNN or k-nearest-neighbor search).

At a high level the way this algorithm works is this. First assume you have a 
graph that has a partial encoding of the nearest neighbor relation, with some 
short and some long-distance links. If this graph is built in the right way 
(has the hierarchical navigable small world property), then you can efficiently 
traverse it to find nearest neighbors (approximately) in log N time where N is 
the number of nodes in the graph. I believe this idea was pioneered in  [1]. 
The great insight in that paper is that if you use the graph search algorithm 
to find the K nearest neighbors of a new document while indexing, and then link 
those neighbors (undirectedly, ie both ways) to the new document, then the 
graph that emerges will have the desired properties.

The implementation I propose for Lucene is as follows. We need two new data 
structures to encode the vectors and the graph. We can encode vectors using a 
light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
dimension and have efficient conversion from bytes to floats). For the graph we 
can use {{SortedNumericDocValues}} where the values we encode are the docids of 
the related documents. Encoding the interdocument relations using docids 
directly will make it relatively fast to traverse the graph since we won't need 
to lookup through an id-field indirection. This choice limits us to building a 
graph-per-segment since it would be impractical to maintain a global graph for 
the whole index in the face of segment merges. However graph-per-segment is a 
very natural at search time - we can traverse each segments' graph 
independently and merge results as we do today for term-based search.

At index time, however, merging graphs is somewhat challenging. While indexing 
we build a graph incrementally, performing searches to construct links among 
neighbors. When merging segments we must construct a new graph containing 
elements of all the merged segments. Ideally we would somehow preserve the work 
done when building the initial graphs, but at least as a start I'd propose we 
construct a new graph from scratch when merging. The process is going to be  
limited, at least initially, to graphs that can fit in RAM since we require 
random access to the entire graph while constructing it: In order to add links 
bidirectionally we must continually update existing documents.

I think we want to express this API to users as a single joint 
{{KnnGraphField}} abstraction that joins together the vectors and the graph as 
a single joint field type. Mostly it just looks like a vector-valued field, but 
has this graph attached to it.

I'll push a branch with my POC and would love to hear comments. It has many 
nocommits, basic design is not really set, there is no Query implementation and 
no integration iwth IndexSearcher, but it does work by some measure using a 
standalone test class. I've tested with uniform random vectors and on my laptop 
indexed 10K documents in around 10 seconds and searched them at 95% recall 
(compared with exact nearest-neighbor baseline) at around 250 QPS. I haven't 
made any attempt to use multithreaded search for this, but it is amenable to 
per-segment concurrency.

[1] 
[https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164]

 

*UPDATES:*
 * (1/12/2020) The up-to-date branch is: 
[https://gitbox.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/lucene-9004-aknn-2]

  was:
"Semantic" search based on machine-learned vector "embeddings" representing 
terms, queries and documents is becoming a must-have feature for a modern 
search engine. SOLR-12890 is exploring various approaches to this, including 
providing vector-based scoring functions. This is a spinoff issue from that.

The idea here is to explore approximate nearest-neighbor search. Researchers 
have found an approach based on navigating a g

[jira] [Updated] (LUCENE-9004) Approximate nearest vector search

2020-01-11 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9004:
--
Description: 
"Semantic" search based on machine-learned vector "embeddings" representing 
terms, queries and documents is becoming a must-have feature for a modern 
search engine. SOLR-12890 is exploring various approaches to this, including 
providing vector-based scoring functions. This is a spinoff issue from that.

The idea here is to explore approximate nearest-neighbor search. Researchers 
have found an approach based on navigating a graph that partially encodes the 
nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
compared to exact nearest neighbor calculations) at a reasonable cost. This 
issue will explore implementing HNSW (hierarchical navigable small-world) 
graphs for the purpose of approximate nearest vector search (often referred to 
as KNN or k-nearest-neighbor search).

At a high level the way this algorithm works is this. First assume you have a 
graph that has a partial encoding of the nearest neighbor relation, with some 
short and some long-distance links. If this graph is built in the right way 
(has the hierarchical navigable small world property), then you can efficiently 
traverse it to find nearest neighbors (approximately) in log N time where N is 
the number of nodes in the graph. I believe this idea was pioneered in  [1]. 
The great insight in that paper is that if you use the graph search algorithm 
to find the K nearest neighbors of a new document while indexing, and then link 
those neighbors (undirectedly, ie both ways) to the new document, then the 
graph that emerges will have the desired properties.

The implementation I propose for Lucene is as follows. We need two new data 
structures to encode the vectors and the graph. We can encode vectors using a 
light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
dimension and have efficient conversion from bytes to floats). For the graph we 
can use {{SortedNumericDocValues}} where the values we encode are the docids of 
the related documents. Encoding the interdocument relations using docids 
directly will make it relatively fast to traverse the graph since we won't need 
to lookup through an id-field indirection. This choice limits us to building a 
graph-per-segment since it would be impractical to maintain a global graph for 
the whole index in the face of segment merges. However graph-per-segment is a 
very natural at search time - we can traverse each segments' graph 
independently and merge results as we do today for term-based search.

At index time, however, merging graphs is somewhat challenging. While indexing 
we build a graph incrementally, performing searches to construct links among 
neighbors. When merging segments we must construct a new graph containing 
elements of all the merged segments. Ideally we would somehow preserve the work 
done when building the initial graphs, but at least as a start I'd propose we 
construct a new graph from scratch when merging. The process is going to be  
limited, at least initially, to graphs that can fit in RAM since we require 
random access to the entire graph while constructing it: In order to add links 
bidirectionally we must continually update existing documents.

I think we want to express this API to users as a single joint 
{{KnnGraphField}} abstraction that joins together the vectors and the graph as 
a single joint field type. Mostly it just looks like a vector-valued field, but 
has this graph attached to it.

I'll push a branch with my POC and would love to hear comments. It has many 
nocommits, basic design is not really set, there is no Query implementation and 
no integration iwth IndexSearcher, but it does work by some measure using a 
standalone test class. I've tested with uniform random vectors and on my laptop 
indexed 10K documents in around 10 seconds and searched them at 95% recall 
(compared with exact nearest-neighbor baseline) at around 250 QPS. I haven't 
made any attempt to use multithreaded search for this, but it is amenable to 
per-segment concurrency.

[1] 
[https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164]

 

*UPDATES:*
 * (1/12/2020) The up-to-date branch is: 
[https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2]

  was:
"Semantic" search based on machine-learned vector "embeddings" representing 
terms, queries and documents is becoming a must-have feature for a modern 
search engine. SOLR-12890 is exploring various approaches to this, including 
providing vector-based scoring functions. This is a spinoff issue from that.

The idea here is to explore approximate nearest-neighbor search. Researchers 
have found an approach based on navigating a graph that partially encodes the 
ne