[jira] [Assigned] (SOLR-13893) BlobRepository looks at the wrong system variable (runtme.lib.size)

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N reassigned SOLR-13893:
---

Assignee: Munendra S N

> BlobRepository looks at the wrong system variable (runtme.lib.size)
> ---
>
> Key: SOLR-13893
> URL: https://issues.apache.org/jira/browse/SOLR-13893
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>Assignee: Munendra S N
>Priority: Major
> Attachments: SOLR-13893.patch
>
>
> Tim Swetland on the user's list pointed out this line in BlobRepository:
> private static final long MAX_JAR_SIZE = 
> Long.parseLong(System.getProperty("runtme.lib.size", String.valueOf(5 * 1024 
> * 1024)));
> "runtme" can't be right.
> [~ichattopadhyaya][~noblepaul] what's your opinion?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-13944) CollapsingQParserPlugin throws NPE instead of bad request

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N reassigned SOLR-13944:
---

Assignee: Munendra S N

> CollapsingQParserPlugin throws NPE instead of bad request
> -
>
> Key: SOLR-13944
> URL: https://issues.apache.org/jira/browse/SOLR-13944
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.3.1
>Reporter: Stefan
>Assignee: Munendra S N
>Priority: Minor
>
>  I noticed the following NPE:
> {code:java}
> java.lang.NullPointerException at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1021)
>  at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1081)
>  at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:230)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1602)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1419)
>  at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
> {code}
> If I am correct, the problem was already addressed in SOLR-8807. The fix does 
> was not working in this case though, because of a syntax error in the query 
> (I used the local parameter syntax twice instead of combining it). The 
> relevant part of the query is:
> {code:java}
> &fq={!tag=collapser}{!collapse field=productId sort='merchantOrder asc, price 
> asc, id asc'}
> {code}
> After discussing that on the mailing list, I was asked to open a ticket, 
> because this situation should result in a bad request instead of a 
> NullpointerException (see 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201911.mbox/%3CCAMJgJxTuSb%3D8szO8bvHiAafJOs08O_NMB4pcaHOXME4Jj-GO2A%40mail.gmail.com%3E])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9264) Remove SimpleFSDirectory in favor of NIOFsDirectory

2020-03-06 Thread Yannick Welsch (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yannick Welsch updated LUCENE-9264:
---
Fix Version/s: master (9.0)

> Remove SimpleFSDirectory in favor of NIOFsDirectory
> ---
>
> Key: LUCENE-9264
> URL: https://issues.apache.org/jira/browse/LUCENE-9264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Yannick Welsch
>Priority: Minor
> Fix For: master (9.0)
>
>
> {{SimpleFSDirectory}} looks to duplicate what's already offered by 
> {{NIOFsDirectory}}. The only difference is that {{SimpleFSDirectory}} is 
> using non-positional reads on the {{FileChannel}} (i.e., reads that are 
> stateful, changing the current position), and {{SimpleFSDirectory}} therefore 
> has to externally synchronize access to the read method.
> On Windows, positional reads are not supported, which is why {{FileChannel}} 
> is already internally using synchronization to guarantee only access by one 
> thread at a time for positional reads (see {{read(ByteBuffer dst, long 
> position)}} in {{FileChannelImpl}}, and {{FileDispatcher.needsPositionLock}}, 
> which returns true on Windows) and the JDK implementation for Windows is 
> emulating positional reads by using non-positional ones, see 
> [http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/windows/native/sun/nio/ch/FileDispatcherImpl.c#l139].
> This means that on Windows, there should be no difference between 
> {{NIOFsDirectory}} and {{SimpleFSDirectory}} in terms of performance (it 
> should be equally poor as both implementations only allow one thread at a 
> time to read). On Linux/Mac, {{NIOFsDirectory}} is superior to 
> {{SimpleFSDirectory}}, however, as positional reads (pread) can be done 
> concurrently.
> My proposal is to remove {{SimpleFSDirectory}} and replace its uses with 
> {{NIOFsDirectory}}, given how similar these two directory implementations are 
> ({{SimpleFSDirectory}} isn't really simpler).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9265) Deprecate SimpleFSDirectory

2020-03-06 Thread Yannick Welsch (Jira)
Yannick Welsch created LUCENE-9265:
--

 Summary: Deprecate SimpleFSDirectory
 Key: LUCENE-9265
 URL: https://issues.apache.org/jira/browse/LUCENE-9265
 Project: Lucene - Core
  Issue Type: Sub-task
Reporter: Yannick Welsch






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] s1monw commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit

2020-03-06 Thread GitBox
s1monw commented on a change in pull request #1155: LUCENE-8962: Add ability to 
selectively merge on commit
URL: https://github.com/apache/lucene-solr/pull/1155#discussion_r388786805
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriterEvents.java
 ##
 @@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+
+package org.apache.lucene.index;
+
+/**
+ * Callback interface to signal various actions taken by IndexWriter.
+ *
+ * @lucene.experimental
+ */
+public interface IndexWriterEvents {
+  /**
+   * A default implementation that ignores all events.
+   */
+  IndexWriterEvents NULL_EVENTS = new IndexWriterEvents() {
+@Override
+public void beginMergeOnCommit() { }
+
+@Override
+public void finishMergeOnCommit() { }
+
+@Override
+public void abandonedMergesOnCommit(int abandonedCount) { }
+  };
+
+  /**
+   * Signals the start of waiting for a merge on commit, returned from
+   * {@link MergePolicy#findFullFlushMerges(MergeTrigger, SegmentInfos, 
MergePolicy.MergeContext)}.
+   */
+  void beginMergeOnCommit();
 
 Review comment:
   I am not really happy with this interface. First and foremost it's only 
partially used in this PR. I also think it doesn't belong here but rather into 
a merge policy? I think IW and merge lifecycle should not be tightly coupled. 
Can we achieve the same with an interface a MP can provide to the IW rather 
than setting it on the IW config. A pull model should be used here instead IMO.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9033) Update Release docs an scripts with new site instructions

2020-03-06 Thread Jira


[ 
https://issues.apache.org/jira/browse/LUCENE-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053200#comment-17053200
 ] 

Jan Høydahl commented on LUCENE-9033:
-

I have started work on releaseWizard.py but not yet ready.

> Update Release docs an scripts with new site instructions
> -
>
> Key: LUCENE-9033
> URL: https://issues.apache.org/jira/browse/LUCENE-9033
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: general/tools
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Major
>
> * releaseWizard.py
>  * ReleaseTODO page
>  * addBackcompatIndexes.py
>  * archive-solr-ref-guide.sh
>  * createPatch.py
>  * publish-solr-ref-guide.sh
>  * solr-ref-gudie/src/meta-docs/publish.adoc
> There may be others



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9033) Update Release docs an scripts with new site instructions

2020-03-06 Thread Jira


 [ 
https://issues.apache.org/jira/browse/LUCENE-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated LUCENE-9033:

Description: 
* releaseWizard.py (Started: janhoy)
 * ReleaseTODO page
 * addBackcompatIndexes.py
 * archive-solr-ref-guide.sh
 * createPatch.py
 * publish-solr-ref-guide.sh
 * -solr-ref-gudie/src/meta-docs/publish.adoc- (/) Done

There may be others

  was:
* releaseWizard.py
 * ReleaseTODO page
 * addBackcompatIndexes.py
 * archive-solr-ref-guide.sh
 * createPatch.py
 * publish-solr-ref-guide.sh
 * solr-ref-gudie/src/meta-docs/publish.adoc

There may be others


> Update Release docs an scripts with new site instructions
> -
>
> Key: LUCENE-9033
> URL: https://issues.apache.org/jira/browse/LUCENE-9033
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: general/tools
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Major
>
> * releaseWizard.py (Started: janhoy)
>  * ReleaseTODO page
>  * addBackcompatIndexes.py
>  * archive-solr-ref-guide.sh
>  * createPatch.py
>  * publish-solr-ref-guide.sh
>  * -solr-ref-gudie/src/meta-docs/publish.adoc- (/) Done
> There may be others



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] ywelsch opened a new pull request #1321: LUCENE-9264: Remove SimpleFSDirectory in favor of NIOFSDirectory

2020-03-06 Thread GitBox
ywelsch opened a new pull request #1321: LUCENE-9264: Remove SimpleFSDirectory 
in favor of NIOFSDirectory
URL: https://github.com/apache/lucene-solr/pull/1321
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14040) solr.xml shareSchema does not work in SolrCloud

2020-03-06 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053208#comment-17053208
 ] 

Bruno Roustant commented on SOLR-14040:
---

This is a good opportunity for me to learn how we deal with feature development 
and work in progress.

As I understand shared-schema (SolrCloud or not) is still in development and 
not documented, still with some limitations/problems. Do we document in the ref 
guide features in progress?
Because it seems weird to me to document a limitation in the ref guide for a 
feature that is not yet documented as available. If we document the limitation, 
shouldn't we also document the feature itself, but actually it is not ready... 
difficult. Is there a section specific to "coming" features?

In fact, where do the users learn about this undocumented feature? Directly in 
the code? This is where we should explain the current limitations and risks.

> solr.xml shareSchema does not work in SolrCloud
> ---
>
> Key: SOLR-14040
> URL: https://issues.apache.org/jira/browse/SOLR-14040
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Blocker
> Fix For: 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> solr.xml has a shareSchema boolean option that can be toggled from the 
> default of false to true in order to share IndexSchema objects within the 
> Solr node.  This is silently ignored in SolrCloud mode.  The pertinent code 
> is {{org.apache.solr.core.ConfigSetService#createConfigSetService}} which 
> creates a CloudConfigSetService that is not related to the SchemaCaching 
> class.  This may not be a big deal in SolrCloud which tends not to deal well 
> with many cores per node but I'm working on changing that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-9723) "Error writing document" on document add caused by NegativeArraySizeException

2020-03-06 Thread Kenny Knecht (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-9723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053221#comment-17053221
 ] 

Kenny Knecht commented on SOLR-9723:


still happens with 7.2.1

> "Error writing document" on document add caused by NegativeArraySizeException
> -
>
> Key: SOLR-9723
> URL: https://issues.apache.org/jira/browse/SOLR-9723
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 6.2.1
> Environment: Windows Server 2012 R2 x64, Java 1.8.0_111
>Reporter: Seva Alekseyev
>Priority: Major
>
> I'm adding documents to SOLR 6.2.1 via /solr/corename/update in a tight loop 
> on multiple threads. After some time, SOLR starts throwing intermittent 
> errors. They don't reproduce. Here's one:
> 2016-11-02 02:29:10.997 ERROR (qtp1389647288-10719) [   x:fscan] 
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Exception 
> writing document id 72513253_HS-RNA-Valenzuela-2.xls to the index; possible 
> analysis error.
>   at 
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:178)
>   at 
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
>   at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at 
> org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessorFactory$AddSchemaFieldsUpdateProcessor.processAdd(AddSchemaFieldsUpdateProcessorFactory.java:335)
>   at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at 
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
>   at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at 
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
>   at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at 
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
>   at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at 
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
>   at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at 
> org.apache.solr.update.processor.FieldNameMutatingUpdateProcessorFactory$1.processAdd(FieldNameMutatingUpdateProcessorFactory.java:74)
>   at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at 
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
>   at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:939)
>   at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1094)
>   at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:720)
>   at 
> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
>   at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at 
> org.apache.solr.update.processor.AbstractDefaultValueUpdateProcessorFactory$DefaultValueUpdateProcessor.processAdd(AbstractDefaultValueUpdateProcessorFactory.java:91)
>   at 
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:250)
>   at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
>   at 
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
>   at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:154)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:2089)
>   at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)
>   at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:459)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
>   a

[GitHub] [lucene-solr] janhoy opened a new pull request #1322: Remove some unused lines from addBackcompatIndexes.py related to svn

2020-03-06 Thread GitBox
janhoy opened a new pull request #1322: Remove some unused lines from 
addBackcompatIndexes.py related to svn
URL: https://github.com/apache/lucene-solr/pull/1322
 
 
   This is dead code in a python script. We don't use svn anymore. I did not 
add corresponding git commands since the releaseWizard explicitly does an add 
after running the script, and noone has complained for so long time :) 
   
   Tagging @sarowe since it seems you have touched this script in the past


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy opened a new pull request #1324: LUCENE-9033 Update ReleaseWizard for new website instructions

2020-03-06 Thread GitBox
janhoy opened a new pull request #1324: LUCENE-9033 Update ReleaseWizard for 
new website instructions
URL: https://github.com/apache/lucene-solr/pull/1324
 
 
   See https://issues.apache.org/jira/browse/LUCENE-9033
   
   This is still work in progress


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread Simon Willnauer (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053231#comment-17053231
 ] 

Simon Willnauer commented on LUCENE-8962:
-

I read through this issue and I want to share some of my thoughts. First, I 
understand the need for this and the motivation, yet every time we add 
something like this to the IndexWriter to do something _as part of_ another 
method it triggers an alarm on my end. I have spent hours and days thinking 
about how IW can be simpler and the biggest issues that I see is that the 
primitives on IW like commit or openReader are doing too much. Just look at 
openReader it's pretty involved and changing the bus factor or making it easier 
to understand is hard. Adding stuff like _wait for merge_ with something like a 
timeout is not what I think we should do neither to _openReader_ nor to 
_commit_.  
That said, I think we can make the same things happen but we should think in 
primitives rather than changing method behavior with configuration. Let me 
explain what I mean:

Lets say we keep _commit_ and _openReader_ the way it is and would instead 
allow to use an existing reader NRT or not and allow itself to _optimize_ 
itself (yeah I said that  - it might be a good name after all). With a slightly 
refactored IW we can share the merge logic and let the reader re-write itself 
since we are talking about very small segments the overhead is very small. This 
would in turn mean that we are doing the work twice ie. the IW would do its 
normal work and might merge later etc. We might even merge this stuff into 
heap-space or so if we have enough I haven't thought too much about that. This 
way we can clean up IW potentially and add a very nice optimization that works 
for commit as well as NRT. We should strive for making IW simpler not do more. 
I hope I wasn't too discouraging. 

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9033) Update Release docs an scripts with new site instructions

2020-03-06 Thread Jira


 [ 
https://issues.apache.org/jira/browse/LUCENE-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated LUCENE-9033:

Description: 
*releaseWizard.py:* Janhoy has started on this, but will likely not finish 
before the 8.5 release

*[ReleaseTODO|https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo] 
page:* I suggest we deprecate this page if folks are happy with releaseWizard, 
which should encapsulate all steps and details, and can also generate an HTML 
TODO document per release.

*publish-solr-ref-guide.sh:* This script can be deleted, not in use since we do 
not publish PDF anymore

*(/) solr-ref-gudie/src/meta-docs/publish.adoc:*  Done

 

There may be other places affected, such as other WIKI pages?

  was:
* releaseWizard.py (Started: janhoy)
 * ReleaseTODO page
 * addBackcompatIndexes.py
 * archive-solr-ref-guide.sh
 * createPatch.py
 * publish-solr-ref-guide.sh
 * -solr-ref-gudie/src/meta-docs/publish.adoc- (/) Done

There may be others


> Update Release docs an scripts with new site instructions
> -
>
> Key: LUCENE-9033
> URL: https://issues.apache.org/jira/browse/LUCENE-9033
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: general/tools
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Major
>
> *releaseWizard.py:* Janhoy has started on this, but will likely not finish 
> before the 8.5 release
> *[ReleaseTODO|https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo] 
> page:* I suggest we deprecate this page if folks are happy with 
> releaseWizard, which should encapsulate all steps and details, and can also 
> generate an HTML TODO document per release.
> *publish-solr-ref-guide.sh:* This script can be deleted, not in use since we 
> do not publish PDF anymore
> *(/) solr-ref-gudie/src/meta-docs/publish.adoc:*  Done
>  
> There may be other places affected, such as other WIKI pages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] ywelsch opened a new pull request #1323: LUCENE-9265: Deprecate SimpleFSDirectory in favor of NIOFSDirectory

2020-03-06 Thread GitBox
ywelsch opened a new pull request #1323: LUCENE-9265: Deprecate 
SimpleFSDirectory in favor of NIOFSDirectory
URL: https://github.com/apache/lucene-solr/pull/1323
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9033) Update Release docs an scripts with new site instructions

2020-03-06 Thread Jira


 [ 
https://issues.apache.org/jira/browse/LUCENE-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated LUCENE-9033:

Description: 
*releaseWizard.py:* [PR#1324|https://github.com/apache/lucene-solr/pull/1324] 
Janhoy has started on this, but will likely not finish before the 8.5 release

*[ReleaseTODO|https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo] 
page:* I suggest we deprecate this page if folks are happy with releaseWizard, 
which should encapsulate all steps and details, and can also generate an HTML 
TODO document per release.

*publish-solr-ref-guide.sh:* This script can be deleted, not in use since we do 
not publish PDF anymore

*(/) solr-ref-gudie/src/meta-docs/publish.adoc:*  Done

 

There may be other places affected, such as other WIKI pages?

  was:
*releaseWizard.py:* Janhoy has started on this, but will likely not finish 
before the 8.5 release

*[ReleaseTODO|https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo] 
page:* I suggest we deprecate this page if folks are happy with releaseWizard, 
which should encapsulate all steps and details, and can also generate an HTML 
TODO document per release.

*publish-solr-ref-guide.sh:* This script can be deleted, not in use since we do 
not publish PDF anymore

*(/) solr-ref-gudie/src/meta-docs/publish.adoc:*  Done

 

There may be other places affected, such as other WIKI pages?


> Update Release docs an scripts with new site instructions
> -
>
> Key: LUCENE-9033
> URL: https://issues.apache.org/jira/browse/LUCENE-9033
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: general/tools
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> *releaseWizard.py:* [PR#1324|https://github.com/apache/lucene-solr/pull/1324] 
> Janhoy has started on this, but will likely not finish before the 8.5 release
> *[ReleaseTODO|https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo] 
> page:* I suggest we deprecate this page if folks are happy with 
> releaseWizard, which should encapsulate all steps and details, and can also 
> generate an HTML TODO document per release.
> *publish-solr-ref-guide.sh:* This script can be deleted, not in use since we 
> do not publish PDF anymore
> *(/) solr-ref-gudie/src/meta-docs/publish.adoc:*  Done
>  
> There may be other places affected, such as other WIKI pages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9264) Remove SimpleFSDirectory in favor of NIOFsDirectory

2020-03-06 Thread Yannick Welsch (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053233#comment-17053233
 ] 

Yannick Welsch commented on LUCENE-9264:


I've opened a pull request for the removal (linked in this issue) and one for 
the deprecation (see sub-task).

> Remove SimpleFSDirectory in favor of NIOFsDirectory
> ---
>
> Key: LUCENE-9264
> URL: https://issues.apache.org/jira/browse/LUCENE-9264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Yannick Welsch
>Priority: Minor
> Fix For: master (9.0)
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{SimpleFSDirectory}} looks to duplicate what's already offered by 
> {{NIOFsDirectory}}. The only difference is that {{SimpleFSDirectory}} is 
> using non-positional reads on the {{FileChannel}} (i.e., reads that are 
> stateful, changing the current position), and {{SimpleFSDirectory}} therefore 
> has to externally synchronize access to the read method.
> On Windows, positional reads are not supported, which is why {{FileChannel}} 
> is already internally using synchronization to guarantee only access by one 
> thread at a time for positional reads (see {{read(ByteBuffer dst, long 
> position)}} in {{FileChannelImpl}}, and {{FileDispatcher.needsPositionLock}}, 
> which returns true on Windows) and the JDK implementation for Windows is 
> emulating positional reads by using non-positional ones, see 
> [http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/windows/native/sun/nio/ch/FileDispatcherImpl.c#l139].
> This means that on Windows, there should be no difference between 
> {{NIOFsDirectory}} and {{SimpleFSDirectory}} in terms of performance (it 
> should be equally poor as both implementations only allow one thread at a 
> time to read). On Linux/Mac, {{NIOFsDirectory}} is superior to 
> {{SimpleFSDirectory}}, however, as positional reads (pread) can be done 
> concurrently.
> My proposal is to remove {{SimpleFSDirectory}} and replace its uses with 
> {{NIOFsDirectory}}, given how similar these two directory implementations are 
> ({{SimpleFSDirectory}} isn't really simpler).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] rmuir commented on issue #1321: LUCENE-9264: Remove SimpleFSDirectory in favor of NIOFSDirectory

2020-03-06 Thread GitBox
rmuir commented on issue #1321: LUCENE-9264: Remove SimpleFSDirectory in favor 
of NIOFSDirectory
URL: https://github.com/apache/lucene-solr/pull/1321#issuecomment-595695483
 
 
   Looks great! Thanks for doing this cleanup. will merge it shortly...


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] s1monw opened a new pull request #1325: Consolidated process event logic after CRUD action

2020-03-06 Thread GitBox
s1monw opened a new pull request #1325: Consolidated process event logic after 
CRUD action
URL: https://github.com/apache/lucene-solr/pull/1325
 
 
   Today we have duplicated logic on how to convert a seqNo into a real
   seqNo and process events based on this. This change consolidated the logic
   into a single method.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy opened a new pull request #1326: Remove unused scripts in dev-tools folder

2020-03-06 Thread GitBox
janhoy opened a new pull request #1326: Remove unused scripts in dev-tools 
folder
URL: https://github.com/apache/lucene-solr/pull/1326
 
 
   Cleanup of unused scripts. Please validate my assumption that this is not in 
use :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] asfgit closed pull request #1320: LUCENE-9257: Always keep FST off-heap. Remove FSTLoadMode and Reader attributes.

2020-03-06 Thread GitBox
asfgit closed pull request #1320: LUCENE-9257: Always keep FST off-heap. Remove 
FSTLoadMode and Reader attributes.
URL: https://github.com/apache/lucene-solr/pull/1320
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9257) FSTLoadMode should not be BlockTree specific as it is used more generally in index package

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053247#comment-17053247
 ] 

ASF subversion and git services commented on LUCENE-9257:
-

Commit 97336434661cf32f4674ddb43901219f678e2608 in lucene-solr's branch 
refs/heads/master from Bruno Roustant
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=9733643 ]

LUCENE-9257: Always keep FST off-heap. Remove FSTLoadMode and Reader attributes.

Closes #1320


> FSTLoadMode should not be BlockTree specific as it is used more generally in 
> index package
> --
>
> Key: LUCENE-9257
> URL: https://issues.apache.org/jira/browse/LUCENE-9257
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> FSTLoadMode and its associate attribute key (static String) are currently 
> defined in BlockTreeTermsReader, but they are actually used outside of 
> BlockTree in the general "index" package.
> CheckIndex and ReadersAndUpdates are using these enum and attribute key to 
> drive the FST load mode through the SegmentReader which is not specific to a 
> postings format. They have an unnecessary dependency to BlockTreeTermsReader.
> We could move FSTLoadMode out of BlockTreeTermsReader, to make it a public 
> enum of the "index" package. That way CheckIndex and ReadersAndUpdates do not 
> import anymore BlockTreeTermsReader.
> This would also allow other postings formats to use the same enum (e.g. 
> LUCENE-9254)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] asfgit closed pull request #1321: LUCENE-9264: Remove SimpleFSDirectory in favor of NIOFSDirectory

2020-03-06 Thread GitBox
asfgit closed pull request #1321: LUCENE-9264: Remove SimpleFSDirectory in 
favor of NIOFSDirectory
URL: https://github.com/apache/lucene-solr/pull/1321
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9264) Remove SimpleFSDirectory in favor of NIOFsDirectory

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053282#comment-17053282
 ] 

ASF subversion and git services commented on LUCENE-9264:
-

Commit 624f5a3c2f5ab25a44b3e3843dbef36d4ed70602 in lucene-solr's branch 
refs/heads/master from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=624f5a3 ]

LUCENE-9264: Remove SimpleFSDirectory in favor of NIOFSDirectory

Closes #1321


> Remove SimpleFSDirectory in favor of NIOFsDirectory
> ---
>
> Key: LUCENE-9264
> URL: https://issues.apache.org/jira/browse/LUCENE-9264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Yannick Welsch
>Priority: Minor
> Fix For: master (9.0)
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{SimpleFSDirectory}} looks to duplicate what's already offered by 
> {{NIOFsDirectory}}. The only difference is that {{SimpleFSDirectory}} is 
> using non-positional reads on the {{FileChannel}} (i.e., reads that are 
> stateful, changing the current position), and {{SimpleFSDirectory}} therefore 
> has to externally synchronize access to the read method.
> On Windows, positional reads are not supported, which is why {{FileChannel}} 
> is already internally using synchronization to guarantee only access by one 
> thread at a time for positional reads (see {{read(ByteBuffer dst, long 
> position)}} in {{FileChannelImpl}}, and {{FileDispatcher.needsPositionLock}}, 
> which returns true on Windows) and the JDK implementation for Windows is 
> emulating positional reads by using non-positional ones, see 
> [http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/windows/native/sun/nio/ch/FileDispatcherImpl.c#l139].
> This means that on Windows, there should be no difference between 
> {{NIOFsDirectory}} and {{SimpleFSDirectory}} in terms of performance (it 
> should be equally poor as both implementations only allow one thread at a 
> time to read). On Linux/Mac, {{NIOFsDirectory}} is superior to 
> {{SimpleFSDirectory}}, however, as positional reads (pread) can be done 
> concurrently.
> My proposal is to remove {{SimpleFSDirectory}} and replace its uses with 
> {{NIOFsDirectory}}, given how similar these two directory implementations are 
> ({{SimpleFSDirectory}} isn't really simpler).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] rmuir merged pull request #1323: LUCENE-9265: Deprecate SimpleFSDirectory in favor of NIOFSDirectory

2020-03-06 Thread GitBox
rmuir merged pull request #1323: LUCENE-9265: Deprecate SimpleFSDirectory in 
favor of NIOFSDirectory
URL: https://github.com/apache/lucene-solr/pull/1323
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9265) Deprecate SimpleFSDirectory

2020-03-06 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-9265:

Fix Version/s: 8.5

> Deprecate SimpleFSDirectory
> ---
>
> Key: LUCENE-9265
> URL: https://issues.apache.org/jira/browse/LUCENE-9265
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Yannick Welsch
>Priority: Minor
> Fix For: 8.5
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9265) Deprecate SimpleFSDirectory

2020-03-06 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-9265.
-
Resolution: Fixed

> Deprecate SimpleFSDirectory
> ---
>
> Key: LUCENE-9265
> URL: https://issues.apache.org/jira/browse/LUCENE-9265
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Yannick Welsch
>Priority: Minor
> Fix For: 8.5
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9264) Remove SimpleFSDirectory in favor of NIOFsDirectory

2020-03-06 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-9264.
-
Resolution: Fixed

> Remove SimpleFSDirectory in favor of NIOFsDirectory
> ---
>
> Key: LUCENE-9264
> URL: https://issues.apache.org/jira/browse/LUCENE-9264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Yannick Welsch
>Priority: Minor
> Fix For: master (9.0)
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {{SimpleFSDirectory}} looks to duplicate what's already offered by 
> {{NIOFsDirectory}}. The only difference is that {{SimpleFSDirectory}} is 
> using non-positional reads on the {{FileChannel}} (i.e., reads that are 
> stateful, changing the current position), and {{SimpleFSDirectory}} therefore 
> has to externally synchronize access to the read method.
> On Windows, positional reads are not supported, which is why {{FileChannel}} 
> is already internally using synchronization to guarantee only access by one 
> thread at a time for positional reads (see {{read(ByteBuffer dst, long 
> position)}} in {{FileChannelImpl}}, and {{FileDispatcher.needsPositionLock}}, 
> which returns true on Windows) and the JDK implementation for Windows is 
> emulating positional reads by using non-positional ones, see 
> [http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/windows/native/sun/nio/ch/FileDispatcherImpl.c#l139].
> This means that on Windows, there should be no difference between 
> {{NIOFsDirectory}} and {{SimpleFSDirectory}} in terms of performance (it 
> should be equally poor as both implementations only allow one thread at a 
> time to read). On Linux/Mac, {{NIOFsDirectory}} is superior to 
> {{SimpleFSDirectory}}, however, as positional reads (pread) can be done 
> concurrently.
> My proposal is to remove {{SimpleFSDirectory}} and replace its uses with 
> {{NIOFsDirectory}}, given how similar these two directory implementations are 
> ({{SimpleFSDirectory}} isn't really simpler).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9265) Deprecate SimpleFSDirectory

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053286#comment-17053286
 ] 

ASF subversion and git services commented on LUCENE-9265:
-

Commit c3d9cd1bf35e858cdb2efa550e8ad17d0e5106ef in lucene-solr's branch 
refs/heads/branch_8x from Yannick Welsch
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c3d9cd1 ]

LUCENE-9265: Deprecate SimpleFSDirectory in favor of NIOFSDirectory (#1323)



> Deprecate SimpleFSDirectory
> ---
>
> Key: LUCENE-9265
> URL: https://issues.apache.org/jira/browse/LUCENE-9265
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Yannick Welsch
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9265) Deprecate SimpleFSDirectory

2020-03-06 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-9265:

Fix Version/s: (was: 8.5)
   8.6

> Deprecate SimpleFSDirectory
> ---
>
> Key: LUCENE-9265
> URL: https://issues.apache.org/jira/browse/LUCENE-9265
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Yannick Welsch
>Priority: Minor
> Fix For: 8.6
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9265) Deprecate SimpleFSDirectory

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053291#comment-17053291
 ] 

ASF subversion and git services commented on LUCENE-9265:
-

Commit 775900c77680058baae5969241c4b3c5bfd82d2b in lucene-solr's branch 
refs/heads/branch_8x from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=775900c ]

LUCENE-9265: move entry to 8.6 section


> Deprecate SimpleFSDirectory
> ---
>
> Key: LUCENE-9265
> URL: https://issues.apache.org/jira/browse/LUCENE-9265
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Yannick Welsch
>Priority: Minor
> Fix For: 8.6
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9033) Update Release docs an scripts with new site instructions

2020-03-06 Thread Jira


 [ 
https://issues.apache.org/jira/browse/LUCENE-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated LUCENE-9033:

Description: 
*releaseWizard.py:* [PR#1324|https://github.com/apache/lucene-solr/pull/1324] 
Janhoy has started on this, but will likely not finish before the 8.5 release

*[ReleaseTODO|https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo] 
page:* I suggest we deprecate this page if folks are happy with releaseWizard, 
which should encapsulate all steps and details, and can also generate an HTML 
TODO document per release.

*publish-solr-ref-guide.sh:* 
[PR#1326|https://github.com/apache/lucene-solr/pull/1326] This script can be 
deleted, not in use since we do not publish PDF anymore

*(/) solr-ref-gudie/src/meta-docs/publish.adoc:*  Done

 

There may be other places affected, such as other WIKI pages?

  was:
*releaseWizard.py:* [PR#1324|https://github.com/apache/lucene-solr/pull/1324] 
Janhoy has started on this, but will likely not finish before the 8.5 release

*[ReleaseTODO|https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo] 
page:* I suggest we deprecate this page if folks are happy with releaseWizard, 
which should encapsulate all steps and details, and can also generate an HTML 
TODO document per release.

*publish-solr-ref-guide.sh:* This script can be deleted, not in use since we do 
not publish PDF anymore

*(/) solr-ref-gudie/src/meta-docs/publish.adoc:*  Done

 

There may be other places affected, such as other WIKI pages?


> Update Release docs an scripts with new site instructions
> -
>
> Key: LUCENE-9033
> URL: https://issues.apache.org/jira/browse/LUCENE-9033
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: general/tools
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> *releaseWizard.py:* [PR#1324|https://github.com/apache/lucene-solr/pull/1324] 
> Janhoy has started on this, but will likely not finish before the 8.5 release
> *[ReleaseTODO|https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo] 
> page:* I suggest we deprecate this page if folks are happy with 
> releaseWizard, which should encapsulate all steps and details, and can also 
> generate an HTML TODO document per release.
> *publish-solr-ref-guide.sh:* 
> [PR#1326|https://github.com/apache/lucene-solr/pull/1326] This script can be 
> deleted, not in use since we do not publish PDF anymore
> *(/) solr-ref-gudie/src/meta-docs/publish.adoc:*  Done
>  
> There may be other places affected, such as other WIKI pages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy closed pull request #880: Tweak header format.

2020-03-06 Thread GitBox
janhoy closed pull request #880: Tweak header format.
URL: https://github.com/apache/lucene-solr/pull/880
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy commented on issue #404: Comment to explain how to use URLClassifyProcessorFactory

2020-03-06 Thread GitBox
janhoy commented on issue #404: Comment to explain how to use 
URLClassifyProcessorFactory
URL: https://github.com/apache/lucene-solr/pull/404#issuecomment-595723278
 
 
   @ohtwadi Do you want to address the review comment so we can merge this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053338#comment-17053338
 ] 

ASF subversion and git services commented on SOLR-13942:


Commit 4cf37ade3531305d508e383b9c16a0c5690bacae in lucene-solr's branch 
refs/heads/master from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4cf37ad ]

Revert "SOLR-13942: /api/cluster/zk/* to fetch raw ZK data"

This reverts commit bc6fa3b65060b17a88013a0378f4a9d285067d82.


> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: New Feature
>  Components: v2 API
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053339#comment-17053339
 ] 

ASF subversion and git services commented on SOLR-13942:


Commit a8e7895c3007f3aa7e58bc52fb610416e80850a6 in lucene-solr's branch 
refs/heads/branch_8x from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a8e7895 ]

Revert "SOLR-13942: /api/cluster/zk/* to fetch raw ZK data"

This reverts commit 2044f8c83ebb0775d76b1e96c168ca936701abd4.


> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: New Feature
>  Components: v2 API
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] noblepaul opened a new pull request #1327: SOLR-13942: /api/cluster/zk/* to fetch raw ZK data

2020-03-06 Thread GitBox
noblepaul opened a new pull request #1327: SOLR-13942: /api/cluster/zk/* to 
fetch raw ZK data
URL: https://github.com/apache/lucene-solr/pull/1327
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-06 Thread Noble Paul (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053344#comment-17053344
 ] 

Noble Paul commented on SOLR-13942:
---

I've opened a new PR. added more tests . Please review

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: New Feature
>  Components: v2 API
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14309) Expose GC logs via an HTTP API

2020-03-06 Thread Noble Paul (Jira)
Noble Paul created SOLR-14309:
-

 Summary: Expose GC logs via an HTTP API
 Key: SOLR-14309
 URL: https://issues.apache.org/jira/browse/SOLR-14309
 Project: Solr
  Issue Type: Sub-task
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Noble Paul






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14310) Expose solr logs with basic filters via HTTP

2020-03-06 Thread Noble Paul (Jira)
Noble Paul created SOLR-14310:
-

 Summary: Expose solr logs with basic filters via HTTP
 Key: SOLR-14310
 URL: https://issues.apache.org/jira/browse/SOLR-14310
 Project: Solr
  Issue Type: Sub-task
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Noble Paul






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9241) fix most memory-hungry tests

2020-03-06 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053361#comment-17053361
 ] 

Dawid Weiss commented on LUCENE-9241:
-

I wasn't really that much concerned; just pointing out the (sad) fact of how 
it's implemented for Windows.

> fix most memory-hungry tests
> 
>
> Key: LUCENE-9241
> URL: https://issues.apache.org/jira/browse/LUCENE-9241
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9241.patch
>
>
> Currently each test jvm has Xmx of 512M. With a modern macbook pro this is 
> 4GB which is pretty crazy.
> On the other hand, if we fix a few edge cases, tests can work with lower 
> heaps such as 128M. This can save many gigabytes (also it finds interesting 
> memory waste/issues).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13944) CollapsingQParserPlugin throws NPE instead of bad request

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-13944:

Attachment: SOLR-13944.patch

> CollapsingQParserPlugin throws NPE instead of bad request
> -
>
> Key: SOLR-13944
> URL: https://issues.apache.org/jira/browse/SOLR-13944
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.3.1
>Reporter: Stefan
>Assignee: Munendra S N
>Priority: Minor
> Attachments: SOLR-13944.patch
>
>
>  I noticed the following NPE:
> {code:java}
> java.lang.NullPointerException at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1021)
>  at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1081)
>  at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:230)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1602)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1419)
>  at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
> {code}
> If I am correct, the problem was already addressed in SOLR-8807. The fix does 
> was not working in this case though, because of a syntax error in the query 
> (I used the local parameter syntax twice instead of combining it). The 
> relevant part of the query is:
> {code:java}
> &fq={!tag=collapser}{!collapse field=productId sort='merchantOrder asc, price 
> asc, id asc'}
> {code}
> After discussing that on the mailing list, I was asked to open a ticket, 
> because this situation should result in a bad request instead of a 
> NullpointerException (see 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201911.mbox/%3CCAMJgJxTuSb%3D8szO8bvHiAafJOs08O_NMB4pcaHOXME4Jj-GO2A%40mail.gmail.com%3E])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053381#comment-17053381
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch 
refs/heads/branch_8x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5 ]

LUCENE-8962: Split test case (#1313)

* LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things
at once: basic semantics of merge on commit and proper behavior when
a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while 
leaving it to MockRandomMergePolicy to enable merge on commit in existing
 test cases to verify that indexing generally works as expected and no new
unexpected exceptions are thrown.

* LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and
isAborted() was false, then the merge must have completed successfully.
Instead, we should know for sure if a given merge was committed, and
only then update our pending commit SegmentInfos.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053380#comment-17053380
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch 
refs/heads/branch_8x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5 ]

LUCENE-8962: Split test case (#1313)

* LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things
at once: basic semantics of merge on commit and proper behavior when
a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while 
leaving it to MockRandomMergePolicy to enable merge on commit in existing
 test cases to verify that indexing generally works as expected and no new
unexpected exceptions are thrown.

* LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and
isAborted() was false, then the merge must have completed successfully.
Instead, we should know for sure if a given merge was committed, and
only then update our pending commit SegmentInfos.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053382#comment-17053382
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch 
refs/heads/branch_8x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5 ]

LUCENE-8962: Split test case (#1313)

* LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things
at once: basic semantics of merge on commit and proper behavior when
a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while 
leaving it to MockRandomMergePolicy to enable merge on commit in existing
 test cases to verify that indexing generally works as expected and no new
unexpected exceptions are thrown.

* LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and
isAborted() was false, then the merge must have completed successfully.
Instead, we should know for sure if a given merge was committed, and
only then update our pending commit SegmentInfos.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13944) CollapsingQParserPlugin throws NPE instead of bad request

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-13944:

Status: Patch Available  (was: Open)

> CollapsingQParserPlugin throws NPE instead of bad request
> -
>
> Key: SOLR-13944
> URL: https://issues.apache.org/jira/browse/SOLR-13944
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.3.1
>Reporter: Stefan
>Assignee: Munendra S N
>Priority: Minor
> Attachments: SOLR-13944.patch
>
>
>  I noticed the following NPE:
> {code:java}
> java.lang.NullPointerException at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1021)
>  at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1081)
>  at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:230)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1602)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1419)
>  at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
> {code}
> If I am correct, the problem was already addressed in SOLR-8807. The fix does 
> was not working in this case though, because of a syntax error in the query 
> (I used the local parameter syntax twice instead of combining it). The 
> relevant part of the query is:
> {code:java}
> &fq={!tag=collapser}{!collapse field=productId sort='merchantOrder asc, price 
> asc, id asc'}
> {code}
> After discussing that on the mailing list, I was asked to open a ticket, 
> because this situation should result in a bad request instead of a 
> NullpointerException (see 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201911.mbox/%3CCAMJgJxTuSb%3D8szO8bvHiAafJOs08O_NMB4pcaHOXME4Jj-GO2A%40mail.gmail.com%3E])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13944) CollapsingQParserPlugin throws NPE instead of bad request

2020-03-06 Thread Munendra S N (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053385#comment-17053385
 ] 

Munendra S N commented on SOLR-13944:
-

 [^SOLR-13944.patch] 
Initial patch for fixing NPE. 

This is valid, as defType for fq is by default is lucene and then localParams 
syntax is parsed but the case of tagging for collapse filter wasn't handled in 
SOLR-8807 (it was doing a simple string match). Here, I have replaced it with 
filter parsing, without that we can't know if there is collapse filter or not.
{noformat}
fq={!tag=collapser}{!collapse field=productId sort='merchantOrder asc, price 
asc, id asc'}
{noformat}
[~tflobbe] As you had asked the user to create the JIRA issue, I would prefer 
if you could take look at this patch

> CollapsingQParserPlugin throws NPE instead of bad request
> -
>
> Key: SOLR-13944
> URL: https://issues.apache.org/jira/browse/SOLR-13944
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.3.1
>Reporter: Stefan
>Assignee: Munendra S N
>Priority: Minor
> Attachments: SOLR-13944.patch
>
>
>  I noticed the following NPE:
> {code:java}
> java.lang.NullPointerException at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1021)
>  at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1081)
>  at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:230)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1602)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1419)
>  at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
> {code}
> If I am correct, the problem was already addressed in SOLR-8807. The fix does 
> was not working in this case though, because of a syntax error in the query 
> (I used the local parameter syntax twice instead of combining it). The 
> relevant part of the query is:
> {code:java}
> &fq={!tag=collapser}{!collapse field=productId sort='merchantOrder asc, price 
> asc, id asc'}
> {code}
> After discussing that on the mailing list, I was asked to open a ticket, 
> because this situation should result in a bad request instead of a 
> NullpointerException (see 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201911.mbox/%3CCAMJgJxTuSb%3D8szO8bvHiAafJOs08O_NMB4pcaHOXME4Jj-GO2A%40mail.gmail.com%3E])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

2020-03-06 Thread Munendra S N (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053387#comment-17053387
 ] 

Munendra S N commented on SOLR-11725:
-

I'm planning to commit this weekend (only to master), let me know if there are 
any concerns

> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> -
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
>  Issue Type: Sub-task
>  Components: Facet Module
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-11725.patch, SOLR-11725.patch, SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bruno-roustant opened a new pull request #1328: LUCENE-9257: Always keep FST off-heap. Remove SegmentReadState.openedFromWriter.

2020-03-06 Thread GitBox
bruno-roustant opened a new pull request #1328: LUCENE-9257: Always keep FST 
off-heap. Remove SegmentReadState.openedFromWriter.
URL: https://github.com/apache/lucene-solr/pull/1328
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9257) FSTLoadMode should not be BlockTree specific as it is used more generally in index package

2020-03-06 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053400#comment-17053400
 ] 

Bruno Roustant commented on LUCENE-9257:


While preparing the port to the 8x branch I saw that I forgot a significant 
cleanup: the openedFromWriter boolean, which was also added to support 
FSTLoadMode logic. So I also remove it.
For visibility I added PR#1328, but I'll commit it immediately.

> FSTLoadMode should not be BlockTree specific as it is used more generally in 
> index package
> --
>
> Key: LUCENE-9257
> URL: https://issues.apache.org/jira/browse/LUCENE-9257
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> FSTLoadMode and its associate attribute key (static String) are currently 
> defined in BlockTreeTermsReader, but they are actually used outside of 
> BlockTree in the general "index" package.
> CheckIndex and ReadersAndUpdates are using these enum and attribute key to 
> drive the FST load mode through the SegmentReader which is not specific to a 
> postings format. They have an unnecessary dependency to BlockTreeTermsReader.
> We could move FSTLoadMode out of BlockTreeTermsReader, to make it a public 
> enum of the "index" package. That way CheckIndex and ReadersAndUpdates do not 
> import anymore BlockTreeTermsReader.
> This would also allow other postings formats to use the same enum (e.g. 
> LUCENE-9254)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bruno-roustant closed pull request #1305: LUCENE-9257: Make FSTLoadMode enum not BlockTree specific.

2020-03-06 Thread GitBox
bruno-roustant closed pull request #1305: LUCENE-9257: Make FSTLoadMode enum 
not BlockTree specific.
URL: https://github.com/apache/lucene-solr/pull/1305
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bruno-roustant commented on issue #1305: LUCENE-9257: Make FSTLoadMode enum not BlockTree specific.

2020-03-06 Thread GitBox
bruno-roustant commented on issue #1305: LUCENE-9257: Make FSTLoadMode enum not 
BlockTree specific.
URL: https://github.com/apache/lucene-solr/pull/1305#issuecomment-595761243
 
 
   Replaced by https://github.com/apache/lucene-solr/pull/1320 to always keep 
FST off-heap.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] sigram opened a new pull request #1329: SOLR-14275: Policy calculations are very slow for large clusters and large operations

2020-03-06 Thread GitBox
sigram opened a new pull request #1329: SOLR-14275: Policy calculations are 
very slow for large clusters and large operations
URL: https://github.com/apache/lucene-solr/pull/1329
 
 
   
   
   
   # Description
   
   See JIRA for the explanation of the problem.
   
   # Solution
   
   Try and reduce the combinatoric explosion in the candidate placements. Use 
caching more effectively.
   
   # Tests
   
   Manual performance tests using the scenario.txt attached to JIRA.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [ ] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [ ] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [ ] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [ ] I have developed this patch against the `master` branch.
   - [ ] I have run `ant precommit` and the appropriate test suite.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref 
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) 
(for Solr changes only).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9257) FSTLoadMode should not be BlockTree specific as it is used more generally in index package

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053412#comment-17053412
 ] 

ASF subversion and git services commented on LUCENE-9257:
-

Commit c73d2c15ba7c5936715408807184c99ab7cfdfd4 in lucene-solr's branch 
refs/heads/master from Bruno Roustant
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c73d2c1 ]

LUCENE-9257: Always keep FST off-heap. Remove SegmentReadState.openedFromWriter.


> FSTLoadMode should not be BlockTree specific as it is used more generally in 
> index package
> --
>
> Key: LUCENE-9257
> URL: https://issues.apache.org/jira/browse/LUCENE-9257
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> FSTLoadMode and its associate attribute key (static String) are currently 
> defined in BlockTreeTermsReader, but they are actually used outside of 
> BlockTree in the general "index" package.
> CheckIndex and ReadersAndUpdates are using these enum and attribute key to 
> drive the FST load mode through the SegmentReader which is not specific to a 
> postings format. They have an unnecessary dependency to BlockTreeTermsReader.
> We could move FSTLoadMode out of BlockTreeTermsReader, to make it a public 
> enum of the "index" package. That way CheckIndex and ReadersAndUpdates do not 
> import anymore BlockTreeTermsReader.
> This would also allow other postings formats to use the same enum (e.g. 
> LUCENE-9254)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dnhatn edited a comment on issue #1155: LUCENE-8962: Add ability to selectively merge on commit

2020-03-06 Thread GitBox
dnhatn edited a comment on issue #1155: LUCENE-8962: Add ability to selectively 
merge on commit
URL: https://github.com/apache/lucene-solr/pull/1155#issuecomment-595607002
 
 
   I missed the fact that `mergeFinished` is executed under IndexWriter lock. I 
will dig into this again. Please ignore my previous comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053427#comment-17053427
 ] 

David Smiley commented on LUCENE-8962:
--

Thanks so much for your input Simon!  We need to fight the complexity here.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9257) FSTLoadMode should not be BlockTree specific as it is used more generally in index package

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053432#comment-17053432
 ] 

ASF subversion and git services commented on LUCENE-9257:
-

Commit e7a61eadf6d2f3c722c791e7470a79b2e919cdeb in lucene-solr's branch 
refs/heads/branch_8x from Bruno Roustant
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e7a61ea ]

LUCENE-9257: Always keep FST off-heap. Remove FSTLoadMode, Reader attributes 
and openedFromWriter.


> FSTLoadMode should not be BlockTree specific as it is used more generally in 
> index package
> --
>
> Key: LUCENE-9257
> URL: https://issues.apache.org/jira/browse/LUCENE-9257
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> FSTLoadMode and its associate attribute key (static String) are currently 
> defined in BlockTreeTermsReader, but they are actually used outside of 
> BlockTree in the general "index" package.
> CheckIndex and ReadersAndUpdates are using these enum and attribute key to 
> drive the FST load mode through the SegmentReader which is not specific to a 
> postings format. They have an unnecessary dependency to BlockTreeTermsReader.
> We could move FSTLoadMode out of BlockTreeTermsReader, to make it a public 
> enum of the "index" package. That way CheckIndex and ReadersAndUpdates do not 
> import anymore BlockTreeTermsReader.
> This would also allow other postings formats to use the same enum (e.g. 
> LUCENE-9254)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9257) FSTLoadMode should not be BlockTree specific as it is used more generally in index package

2020-03-06 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved LUCENE-9257.

Fix Version/s: 8.6
   Resolution: Fixed

Thanks reviewers!

> FSTLoadMode should not be BlockTree specific as it is used more generally in 
> index package
> --
>
> Key: LUCENE-9257
> URL: https://issues.apache.org/jira/browse/LUCENE-9257
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
> Fix For: 8.6
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> FSTLoadMode and its associate attribute key (static String) are currently 
> defined in BlockTreeTermsReader, but they are actually used outside of 
> BlockTree in the general "index" package.
> CheckIndex and ReadersAndUpdates are using these enum and attribute key to 
> drive the FST load mode through the SegmentReader which is not specific to a 
> postings format. They have an unnecessary dependency to BlockTreeTermsReader.
> We could move FSTLoadMode out of BlockTreeTermsReader, to make it a public 
> enum of the "index" package. That way CheckIndex and ReadersAndUpdates do not 
> import anymore BlockTreeTermsReader.
> This would also allow other postings formats to use the same enum (e.g. 
> LUCENE-9254)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bruno-roustant closed pull request #1328: LUCENE-9257: Always keep FST off-heap. Remove SegmentReadState.openedFromWriter.

2020-03-06 Thread GitBox
bruno-roustant closed pull request #1328: LUCENE-9257: Always keep FST 
off-heap. Remove SegmentReadState.openedFromWriter.
URL: https://github.com/apache/lucene-solr/pull/1328
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13199) NPE due to unexpected null return value from QueryBitSetProducer.getBitSet

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-13199:

Attachment: SOLR-13199.patch

> NPE due to unexpected null return value from QueryBitSetProducer.getBitSet
> --
>
> Key: SOLR-13199
> URL: https://issues.apache.org/jira/browse/SOLR-13199
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: master (9.0)
> Environment: h1. Steps to reproduce
> * Use a Linux machine.
> *  Build commit {{ea2c8ba}} of Solr as described in the section below.
> * Build the films collection as described below.
> * Start the server using the command {{./bin/solr start -f -p 8983 -s 
> /tmp/home}}
> * Request the URL given in the bug description.
> h1. Compiling the server
> {noformat}
> git clone https://github.com/apache/lucene-solr
> cd lucene-solr
> git checkout ea2c8ba
> ant compile
> cd solr
> ant server
> {noformat}
> h1. Building the collection
> We followed [Exercise 
> 2|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html#exercise-2] from 
> the [Solr 
> Tutorial|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html]. The 
> attached file ({{home.zip}}) gives the contents of folder {{/tmp/home}} that 
> you will obtain by following the steps below:
> {noformat}
> mkdir -p /tmp/home
> echo '' > 
> /tmp/home/solr.xml
> {noformat}
> In one terminal start a Solr instance in foreground:
> {noformat}
> ./bin/solr start -f -p 8983 -s /tmp/home
> {noformat}
> In another terminal, create a collection of movies, with no shards and no 
> replication, and initialize it:
> {noformat}
> bin/solr create -c films
> curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": 
> {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' 
> http://localhost:8983/solr/films/schema
> curl -X POST -H 'Content-type:application/json' --data-binary 
> '{"add-copy-field" : {"source":"*","dest":"_text_"}}' 
> http://localhost:8983/solr/films/schema
> ./bin/post -c films example/films/films.json
> {noformat}
>Reporter: Johannes Kloos
>Priority: Minor
>  Labels: diffblue, newdev
> Attachments: SOLR-13199.patch, home.zip
>
>
> Requesting the following URL causes Solr to return an HTTP 500 error response:
> {noformat}
> http://localhost:8983/solr/films/select?fl=[child%20parentFilter=ge]&q=*:*
> {noformat}
> The error response seems to be caused by the following uncaught exception:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.solr.response.transform.ChildDocTransformer.transform(ChildDocTransformer.java:92)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:103)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:1)
> at 
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:184)
> at 
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:136)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedListAsMapWithDups(JsonTextWriter.java:386)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedList(JsonTextWriter.java:292)
> at org.apache.solr.response.JSONWriter.writeResponse(JSONWriter.java:73)
> {noformat}
> In ChildDocTransformer.transform, we have the following lines:
> {noformat}
> final BitSet segParentsBitSet = parentsFilter.getBitSet(leafReaderContext);
> final int segPrevRootId = segRootId==0? -1: 
> segParentsBitSet.prevSetBit(segRootId - 1); // can return -1 and that's okay
> {noformat}
> But getBitSet can return null if the set of DocIds is empty:
> {noformat}
> return docIdSet == DocIdSet.EMPTY ? null : ((BitDocIdSet) docIdSet).bits();
> {noformat}
> We found this bug using [Diffblue Microservices 
> Testing|https://www.diffblue.com/labs/?utm_source=solr-br]. Find more 
> information on this [fuzz testing 
> campaign|https://www.diffblue.com/blog/2018/12/19/diffblue-microservice-testing-a-sneak-peek-at-our-early-product-and-results?utm_source=solr-br].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13199) NPE due to unexpected null return value from QueryBitSetProducer.getBitSet

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-13199:

Status: Patch Available  (was: Open)

> NPE due to unexpected null return value from QueryBitSetProducer.getBitSet
> --
>
> Key: SOLR-13199
> URL: https://issues.apache.org/jira/browse/SOLR-13199
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: master (9.0)
> Environment: h1. Steps to reproduce
> * Use a Linux machine.
> *  Build commit {{ea2c8ba}} of Solr as described in the section below.
> * Build the films collection as described below.
> * Start the server using the command {{./bin/solr start -f -p 8983 -s 
> /tmp/home}}
> * Request the URL given in the bug description.
> h1. Compiling the server
> {noformat}
> git clone https://github.com/apache/lucene-solr
> cd lucene-solr
> git checkout ea2c8ba
> ant compile
> cd solr
> ant server
> {noformat}
> h1. Building the collection
> We followed [Exercise 
> 2|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html#exercise-2] from 
> the [Solr 
> Tutorial|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html]. The 
> attached file ({{home.zip}}) gives the contents of folder {{/tmp/home}} that 
> you will obtain by following the steps below:
> {noformat}
> mkdir -p /tmp/home
> echo '' > 
> /tmp/home/solr.xml
> {noformat}
> In one terminal start a Solr instance in foreground:
> {noformat}
> ./bin/solr start -f -p 8983 -s /tmp/home
> {noformat}
> In another terminal, create a collection of movies, with no shards and no 
> replication, and initialize it:
> {noformat}
> bin/solr create -c films
> curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": 
> {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' 
> http://localhost:8983/solr/films/schema
> curl -X POST -H 'Content-type:application/json' --data-binary 
> '{"add-copy-field" : {"source":"*","dest":"_text_"}}' 
> http://localhost:8983/solr/films/schema
> ./bin/post -c films example/films/films.json
> {noformat}
>Reporter: Johannes Kloos
>Priority: Minor
>  Labels: diffblue, newdev
> Attachments: SOLR-13199.patch, home.zip
>
>
> Requesting the following URL causes Solr to return an HTTP 500 error response:
> {noformat}
> http://localhost:8983/solr/films/select?fl=[child%20parentFilter=ge]&q=*:*
> {noformat}
> The error response seems to be caused by the following uncaught exception:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.solr.response.transform.ChildDocTransformer.transform(ChildDocTransformer.java:92)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:103)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:1)
> at 
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:184)
> at 
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:136)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedListAsMapWithDups(JsonTextWriter.java:386)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedList(JsonTextWriter.java:292)
> at org.apache.solr.response.JSONWriter.writeResponse(JSONWriter.java:73)
> {noformat}
> In ChildDocTransformer.transform, we have the following lines:
> {noformat}
> final BitSet segParentsBitSet = parentsFilter.getBitSet(leafReaderContext);
> final int segPrevRootId = segRootId==0? -1: 
> segParentsBitSet.prevSetBit(segRootId - 1); // can return -1 and that's okay
> {noformat}
> But getBitSet can return null if the set of DocIds is empty:
> {noformat}
> return docIdSet == DocIdSet.EMPTY ? null : ((BitDocIdSet) docIdSet).bits();
> {noformat}
> We found this bug using [Diffblue Microservices 
> Testing|https://www.diffblue.com/labs/?utm_source=solr-br]. Find more 
> information on this [fuzz testing 
> campaign|https://www.diffblue.com/blog/2018/12/19/diffblue-microservice-testing-a-sneak-peek-at-our-early-product-and-results?utm_source=solr-br].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-13199) NPE due to unexpected null return value from QueryBitSetProducer.getBitSet

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N reassigned SOLR-13199:
---

Assignee: Munendra S N

> NPE due to unexpected null return value from QueryBitSetProducer.getBitSet
> --
>
> Key: SOLR-13199
> URL: https://issues.apache.org/jira/browse/SOLR-13199
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: master (9.0)
> Environment: h1. Steps to reproduce
> * Use a Linux machine.
> *  Build commit {{ea2c8ba}} of Solr as described in the section below.
> * Build the films collection as described below.
> * Start the server using the command {{./bin/solr start -f -p 8983 -s 
> /tmp/home}}
> * Request the URL given in the bug description.
> h1. Compiling the server
> {noformat}
> git clone https://github.com/apache/lucene-solr
> cd lucene-solr
> git checkout ea2c8ba
> ant compile
> cd solr
> ant server
> {noformat}
> h1. Building the collection
> We followed [Exercise 
> 2|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html#exercise-2] from 
> the [Solr 
> Tutorial|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html]. The 
> attached file ({{home.zip}}) gives the contents of folder {{/tmp/home}} that 
> you will obtain by following the steps below:
> {noformat}
> mkdir -p /tmp/home
> echo '' > 
> /tmp/home/solr.xml
> {noformat}
> In one terminal start a Solr instance in foreground:
> {noformat}
> ./bin/solr start -f -p 8983 -s /tmp/home
> {noformat}
> In another terminal, create a collection of movies, with no shards and no 
> replication, and initialize it:
> {noformat}
> bin/solr create -c films
> curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": 
> {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' 
> http://localhost:8983/solr/films/schema
> curl -X POST -H 'Content-type:application/json' --data-binary 
> '{"add-copy-field" : {"source":"*","dest":"_text_"}}' 
> http://localhost:8983/solr/films/schema
> ./bin/post -c films example/films/films.json
> {noformat}
>Reporter: Johannes Kloos
>Assignee: Munendra S N
>Priority: Minor
>  Labels: diffblue, newdev
> Attachments: SOLR-13199.patch, home.zip
>
>
> Requesting the following URL causes Solr to return an HTTP 500 error response:
> {noformat}
> http://localhost:8983/solr/films/select?fl=[child%20parentFilter=ge]&q=*:*
> {noformat}
> The error response seems to be caused by the following uncaught exception:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.solr.response.transform.ChildDocTransformer.transform(ChildDocTransformer.java:92)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:103)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:1)
> at 
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:184)
> at 
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:136)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedListAsMapWithDups(JsonTextWriter.java:386)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedList(JsonTextWriter.java:292)
> at org.apache.solr.response.JSONWriter.writeResponse(JSONWriter.java:73)
> {noformat}
> In ChildDocTransformer.transform, we have the following lines:
> {noformat}
> final BitSet segParentsBitSet = parentsFilter.getBitSet(leafReaderContext);
> final int segPrevRootId = segRootId==0? -1: 
> segParentsBitSet.prevSetBit(segRootId - 1); // can return -1 and that's okay
> {noformat}
> But getBitSet can return null if the set of DocIds is empty:
> {noformat}
> return docIdSet == DocIdSet.EMPTY ? null : ((BitDocIdSet) docIdSet).bits();
> {noformat}
> We found this bug using [Diffblue Microservices 
> Testing|https://www.diffblue.com/labs/?utm_source=solr-br]. Find more 
> information on this [fuzz testing 
> campaign|https://www.diffblue.com/blog/2018/12/19/diffblue-microservice-testing-a-sneak-peek-at-our-early-product-and-results?utm_source=solr-br].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13199) NPE due to unexpected null return value from QueryBitSetProducer.getBitSet

2020-03-06 Thread Munendra S N (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053452#comment-17053452
 ] 

Munendra S N commented on SOLR-13199:
-

 [^SOLR-13199.patch] 
NPE is still occurring when using without nestedPath field. I have removed 
version check which wasn't required.
When parentFilter is null then, setting parentFilter to {{MatchNoDocsQuery}} as 
parentFilter String is specified after parsing it resolves to {{null}}
[~dsmiley] Could you please review this once?

> NPE due to unexpected null return value from QueryBitSetProducer.getBitSet
> --
>
> Key: SOLR-13199
> URL: https://issues.apache.org/jira/browse/SOLR-13199
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: master (9.0)
> Environment: h1. Steps to reproduce
> * Use a Linux machine.
> *  Build commit {{ea2c8ba}} of Solr as described in the section below.
> * Build the films collection as described below.
> * Start the server using the command {{./bin/solr start -f -p 8983 -s 
> /tmp/home}}
> * Request the URL given in the bug description.
> h1. Compiling the server
> {noformat}
> git clone https://github.com/apache/lucene-solr
> cd lucene-solr
> git checkout ea2c8ba
> ant compile
> cd solr
> ant server
> {noformat}
> h1. Building the collection
> We followed [Exercise 
> 2|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html#exercise-2] from 
> the [Solr 
> Tutorial|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html]. The 
> attached file ({{home.zip}}) gives the contents of folder {{/tmp/home}} that 
> you will obtain by following the steps below:
> {noformat}
> mkdir -p /tmp/home
> echo '' > 
> /tmp/home/solr.xml
> {noformat}
> In one terminal start a Solr instance in foreground:
> {noformat}
> ./bin/solr start -f -p 8983 -s /tmp/home
> {noformat}
> In another terminal, create a collection of movies, with no shards and no 
> replication, and initialize it:
> {noformat}
> bin/solr create -c films
> curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": 
> {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' 
> http://localhost:8983/solr/films/schema
> curl -X POST -H 'Content-type:application/json' --data-binary 
> '{"add-copy-field" : {"source":"*","dest":"_text_"}}' 
> http://localhost:8983/solr/films/schema
> ./bin/post -c films example/films/films.json
> {noformat}
>Reporter: Johannes Kloos
>Priority: Minor
>  Labels: diffblue, newdev
> Attachments: SOLR-13199.patch, home.zip
>
>
> Requesting the following URL causes Solr to return an HTTP 500 error response:
> {noformat}
> http://localhost:8983/solr/films/select?fl=[child%20parentFilter=ge]&q=*:*
> {noformat}
> The error response seems to be caused by the following uncaught exception:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.solr.response.transform.ChildDocTransformer.transform(ChildDocTransformer.java:92)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:103)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:1)
> at 
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:184)
> at 
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:136)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedListAsMapWithDups(JsonTextWriter.java:386)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedList(JsonTextWriter.java:292)
> at org.apache.solr.response.JSONWriter.writeResponse(JSONWriter.java:73)
> {noformat}
> In ChildDocTransformer.transform, we have the following lines:
> {noformat}
> final BitSet segParentsBitSet = parentsFilter.getBitSet(leafReaderContext);
> final int segPrevRootId = segRootId==0? -1: 
> segParentsBitSet.prevSetBit(segRootId - 1); // can return -1 and that's okay
> {noformat}
> But getBitSet can return null if the set of DocIds is empty:
> {noformat}
> return docIdSet == DocIdSet.EMPTY ? null : ((BitDocIdSet) docIdSet).bits();
> {noformat}
> We found this bug using [Diffblue Microservices 
> Testing|https://www.diffblue.com/labs/?utm_source=solr-br]. Find more 
> information on this [fuzz testing 
> campaign|https://www.diffblue.com/blog/2018/12/19/diffblue-microservice-testing-a-sneak-peek-at-our-early-product-and-results?utm_source=solr-br].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8103) QueryValueSource should use TwoPhaseIterator

2020-03-06 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053517#comment-17053517
 ] 

David Smiley commented on LUCENE-8103:
--

Notice that {{TwoPhaseIterator.asDocIdSetIterator(tpi);}} will return an 
implementation whose {{advance(docId)}} method will move beyond the passed in 
docID and call matches until it finds a match.  That is a waste _if the user of 
this DISI doesn't care what the next matching document is if the approximation 
doesn't match_.  So QueryValueSource's exists() method could work with the 
approximation first and if that matches, then and only then call TPI.match.  If 
there is no TPI then the the scorer's DISI is accurate.

> QueryValueSource should use TwoPhaseIterator
> 
>
> Key: LUCENE-8103
> URL: https://issues.apache.org/jira/browse/LUCENE-8103
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/other
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-8103.patch
>
>
> QueryValueSource (in "queries" module) is a ValueSource representation of a 
> Query; the score is the value.  It ought to try to use a TwoPhaseIterator 
> from the query if it can be offered. This will prevent possibly expensive 
> advancing beyond documents that we aren't interested in.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread Nhat Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053545#comment-17053545
 ] 

Nhat Nguyen commented on LUCENE-8962:
-

Some engine tests in Elasticsearch are failing because of this change. I am 
working to backport them to Lucene so that we can catch similar issues in 
Lucene.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13893) BlobRepository looks at the wrong system variable (runtme.lib.size)

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-13893:

Status: Patch Available  (was: Open)

> BlobRepository looks at the wrong system variable (runtme.lib.size)
> ---
>
> Key: SOLR-13893
> URL: https://issues.apache.org/jira/browse/SOLR-13893
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>Assignee: Munendra S N
>Priority: Major
> Attachments: SOLR-13893.patch, SOLR-13893.patch
>
>
> Tim Swetland on the user's list pointed out this line in BlobRepository:
> private static final long MAX_JAR_SIZE = 
> Long.parseLong(System.getProperty("runtme.lib.size", String.valueOf(5 * 1024 
> * 1024)));
> "runtme" can't be right.
> [~ichattopadhyaya][~noblepaul] what's your opinion?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13893) BlobRepository looks at the wrong system variable (runtme.lib.size)

2020-03-06 Thread Munendra S N (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053562#comment-17053562
 ] 

Munendra S N commented on SOLR-13893:
-

 [^SOLR-13893.patch] 
Slightly modified patch

> BlobRepository looks at the wrong system variable (runtme.lib.size)
> ---
>
> Key: SOLR-13893
> URL: https://issues.apache.org/jira/browse/SOLR-13893
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>Assignee: Munendra S N
>Priority: Major
> Attachments: SOLR-13893.patch, SOLR-13893.patch
>
>
> Tim Swetland on the user's list pointed out this line in BlobRepository:
> private static final long MAX_JAR_SIZE = 
> Long.parseLong(System.getProperty("runtme.lib.size", String.valueOf(5 * 1024 
> * 1024)));
> "runtme" can't be right.
> [~ichattopadhyaya][~noblepaul] what's your opinion?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14289) Solr may attempt to check Chroot after already having connected once

2020-03-06 Thread Mike Drob (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053561#comment-17053561
 ] 

Mike Drob commented on SOLR-14289:
--

[~dsmiley] - seems like we're working on similar problems around speeding up 
core startup - can you take a look at this and let me know what you think?

> Solr may attempt to check Chroot after already having connected once
> 
>
> Key: SOLR-14289
> URL: https://issues.apache.org/jira/browse/SOLR-14289
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Server
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
> Attachments: Screen Shot 2020-02-26 at 2.56.14 PM.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> On server startup, we will attempt to load the solr.xml from zookeeper if we 
> have the right properties set, and then later when starting up the core 
> container will take time to verify (and create) the chroot even if it is the 
> same string that we already used before. We can likely skip the second 
> short-lived zookeeper connection to speed up our startup sequence a little 
> bit.
>  
> See this attached image from thread profiling during startup.
> !Screen Shot 2020-02-26 at 2.56.14 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13893) BlobRepository looks at the wrong system variable (runtme.lib.size)

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-13893:

Attachment: SOLR-13893.patch

> BlobRepository looks at the wrong system variable (runtme.lib.size)
> ---
>
> Key: SOLR-13893
> URL: https://issues.apache.org/jira/browse/SOLR-13893
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>Assignee: Munendra S N
>Priority: Major
> Attachments: SOLR-13893.patch, SOLR-13893.patch
>
>
> Tim Swetland on the user's list pointed out this line in BlobRepository:
> private static final long MAX_JAR_SIZE = 
> Long.parseLong(System.getProperty("runtme.lib.size", String.valueOf(5 * 1024 
> * 1024)));
> "runtme" can't be right.
> [~ichattopadhyaya][~noblepaul] what's your opinion?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N reassigned SOLR-11725:
---

Assignee: Munendra S N

> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> -
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
>  Issue Type: Sub-task
>  Components: Facet Module
>Reporter: Chris M. Hostetter
>Assignee: Munendra S N
>Priority: Major
> Attachments: SOLR-11725.patch, SOLR-11725.patch, SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-11359) An autoscaling/suggestions endpoint to recommend operations

2020-03-06 Thread Megan Carey (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052544#comment-17052544
 ] 

Megan Carey edited comment on SOLR-11359 at 3/6/20, 4:38 PM:
-

Would it be possible to explicitly return the URL to hit for applying the 
suggestion? i.e. rather than return an HTTP method, operation type, etc. just 
return the constructed URL for executing the action?

Also, are you considering writing a cron to periodically execute these 
suggestions? Or was the intention for these to be manually applied? 
[~noble.paul]


was (Author: megancarey):
Would it be possible to explicitly return the URL to hit for applying the 
suggestion? i.e. rather than return an HTTP method, operation type, etc. just 
return the constructed URL for executing the action?

Also, are you considering writing a cron to periodically execute these 
suggestions?

> An autoscaling/suggestions endpoint to recommend operations
> ---
>
> Key: SOLR-11359
> URL: https://issues.apache.org/jira/browse/SOLR-11359
> Project: Solr
>  Issue Type: New Feature
>  Components: AutoScaling
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Attachments: SOLR-11359.patch
>
>
> Autoscaling can make suggestions to users on what operations they can perform 
> to improve the health of the cluster
> The suggestions will have the following information
> * http end point
> * http method (POST,DELETE)
> * command payload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bruno-roustant commented on issue #1301: LUCENE-9254: UniformSplit supports FST off-heap.

2020-03-06 Thread GitBox
bruno-roustant commented on issue #1301: LUCENE-9254: UniformSplit supports FST 
off-heap.
URL: https://github.com/apache/lucene-solr/pull/1301#issuecomment-595856255
 
 
   Updated after LUCENE-9257 removed FSTLoadMode. Now FST is off-heap by 
default. It is possible to force it with a boolean in the 
UniformSplitPostingsFormat.
   Also, FST is always on-heap if there is block encoding/decoding.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-9258) DocTermsIndexDocValues should not assume it's operating on a SortedDocValues field

2020-03-06 Thread David Smiley (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley reassigned LUCENE-9258:


Assignee: David Smiley

> DocTermsIndexDocValues should not assume it's operating on a SortedDocValues 
> field
> --
>
> Key: LUCENE-9258
> URL: https://issues.apache.org/jira/browse/LUCENE-9258
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.7.2, 8.4
>Reporter: Michele Palmia
>Assignee: David Smiley
>Priority: Minor
> Attachments: LUCENE-9258.patch
>
>
> When requesting a new _ValueSourceScorer_ (with _getRangeScorer_) from 
> _DocTermsIndexDocValues_ , the latter instantiates a new iterator on 
> _SortedDocValues_ regardless of the fact that the underlying field can 
> actually be of a different type (e.g. a _SortedSetDocValues_ processed 
> through a _SortedSetSelector_).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9258) DocTermsIndexDocValues should not assume it's operating on a SortedDocValues field

2020-03-06 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053589#comment-17053589
 ] 

David Smiley commented on LUCENE-9258:
--

Makes sense to me; your test is perfect.  I'm curious; how did you see this at 
a higher level (e.g. Solr or ES)?  The issue title & details here are a bit 
geeky / low-level and I'm trying to think of a good CHANGES.txt entry that 
might be more meaningful to users.

> DocTermsIndexDocValues should not assume it's operating on a SortedDocValues 
> field
> --
>
> Key: LUCENE-9258
> URL: https://issues.apache.org/jira/browse/LUCENE-9258
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.7.2, 8.4
>Reporter: Michele Palmia
>Assignee: David Smiley
>Priority: Minor
> Attachments: LUCENE-9258.patch
>
>
> When requesting a new _ValueSourceScorer_ (with _getRangeScorer_) from 
> _DocTermsIndexDocValues_ , the latter instantiates a new iterator on 
> _SortedDocValues_ regardless of the fact that the underlying field can 
> actually be of a different type (e.g. a _SortedSetDocValues_ processed 
> through a _SortedSetSelector_).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Comment: was deleted

(was: Hi, [~jtibshirani], thanks for your suggestions!

??"I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms."??

In the previous implementation 
([https://github.com/irvingzhang/lucene-solr/commit/eb5f79ea7a705595821f73f80a0c5752061869b2]),
 the cluster information is divided into two parts – meta (.ifi) and data(.ifd) 
as shown in the following figure, where each cluster with a postings list is 
stored in the data file (.ifd) and not kept on-heap. A major concern of this 
implementation is its reading performance of cluster data since reading is a 
very frequent behavior on kNN search. I will test and check the performance. 

!image-2020-02-16-15-05-02-451.png!

??"Because of this concern, it could be nice to include benchmarks for index 
time (in addition to QPS)..."??

Many thanks! I will check the links you mentioned and consider optimize the 
clustering cost. In addition, more benchmarks will be added soon.

 
h2. *UPDATE – Feb. 24, 2020*

I have  add a new implementation for IVF index, which has been marked as ***V2 
under the package org.apache.lucene.codecs.lucene90. In current implementation, 
the IVF index has been divided into two files with suffixes .ifi and .ifd, 
respectively. The .ifd file will be read if cluster information is needed. The 
experiments are conducted on dataset sift1M (Test codes: 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/KnnIvfPerformTester.java]),
 detailed results are as follows,
 # add document -- 3921 ms;
 # commit -- 3912286 ms (mainly spent on k-means training, 10 iterations, 4000 
centroids, totally 512,000 vectors used for training);
 # R@100 recall time and recall ratio are listed in the following table

 
||nprobe||avg. search time (ms)||recall ratio (%)||
|8|28.0755|44.154|
|16|27.1745|57.9945|
|32|32.986|71.7003|
|64|40.4082|83.50471|
|128|50.9569|92.07929|
|256|73.923|97.150894|

 Compare with on-heap implementation of IVF index, the query time increases 
significantly (22%~71%). Actually, IVF index is comprised of unique docIDs, and 
will not take up too much memory. *There is a small argument about whether to 
keep the cluster information on-heap or not. Hope to hear more suggestions.*

 

 )

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
>

[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: (was: 
1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png)

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Comment: was deleted

(was: The index format of IVFFlat is organized as follows, 
!1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png!

In general, the number of centroids lies within the interval [4 * sqrt(N), 16 * 
sqrt(N)], where N is the data set size. We use (4 * sqrt(N)) as the actual 
value of centroid number to balance between accuracy and computational load, 
denoted by c. And the full data set is used for training if its size no larger 
than 200,000. Otherwise (128 *  c) points are selected after shuffling for 
training in order to accelerate training.

Experiments have been conducted on a large data set (sift1M, 
[http://corpus-texmex.irisa.fr/]) to verify the implementation of IVFFlat. The 
base data set (sift_base.fvecs) contains 1,000,000 vectors with 128 dimensions. 
And 10,000 queries (sift_query.fvecs) are used for recall testing. The recall 
ratio follows

Recall=(Recall vectors in groundTruth) / (number of queries * TopK), where 
number of queries = 10,000 and TopK=100. The results are as follows (single 
thread and single segment),

 
||nprobe||avg. search time (ms)||recall (%)||
|8|16.3827|44.24|
|16|16.5834|58.04|
|32|19.2031|71.55|
|64|24.7065|83.30|
|128|34.9165|92.03|
|256|60.5844|97.18|
| | | |

**The test codes could be found in 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/KnnIvfAndGraphPerformTester.java.|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/KnnIvfAndGraphPerformTester.java]

 

 

 

 )

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users

[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: (was: image-2020-02-16-15-05-02-451.png)

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: image-2020-03-07-01-22-06-132.png

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: image-2020-03-07-01-22-06-132.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on a change in pull request #1294: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches

2020-03-06 Thread GitBox
atris commented on a change in pull request #1294: LUCENE-9074: Slice 
Allocation Control Plane For Concurrent Searches
URL: https://github.com/apache/lucene-solr/pull/1294#discussion_r389038791
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java
 ##
 @@ -211,6 +213,18 @@ public IndexSearcher(IndexReaderContext context, Executor 
executor) {
 assert context.isTopLevel: "IndexSearcher's ReaderContext must be topLevel 
for reader" + context.reader();
 reader = context.reader();
 this.executor = executor;
+this.sliceExecutionControlPlane = executor == null ? null : 
getSliceExecutionControlPlane(executor);
+this.readerContext = context;
+leafContexts = context.leaves();
+this.leafSlices = executor == null ? null : slices(leafContexts);
+  }
+
+  // Package private for testing
+  IndexSearcher(IndexReaderContext context, Executor executor, 
SliceExecutionControlPlane sliceExecutionControlPlane) {
+assert context.isTopLevel: "IndexSearcher's ReaderContext must be topLevel 
for reader" + context.reader();
+reader = context.reader();
+this.executor = executor;
+this.sliceExecutionControlPlane = executor == null ? null : 
sliceExecutionControlPlane;
 
 Review comment:
   Not sure if I understood your point. The passed in instance is the one being 
assigned to the member?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: image-2020-03-07-01-25-58-047.png

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: image-2020-03-07-01-22-06-132.png, 
> image-2020-03-07-01-25-58-047.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: image-2020-03-07-01-27-12-859.png

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: image-2020-03-07-01-22-06-132.png, 
> image-2020-03-07-01-25-58-047.png, image-2020-03-07-01-27-12-859.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053620#comment-17053620
 ] 

Michael Sokolov commented on LUCENE-8962:
-

Based on [~simonw]'s recent comments in github, plus difficulty getting tests 
to pass consistently (apparently there are more failing tests in Elasticland), 
we should probably revert for now, at least from 8.x and 8.5 branches. I am 
tied up for the moment, but will be able to do the revert this weekend.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053623#comment-17053623
 ] 

Xin-Chun Zhang commented on LUCENE-9136:


1. My personal git branch: 
[https://github.com/irvingzhang/lucene-solr/tree/jira/lucene-9136-ann-ivfflat].

2. The vector format is as follows, 

!image-2020-03-07-01-25-58-047.png|width=535,height=297!

 

Structure of IVF index meta is as follows,

!image-2020-03-07-01-27-12-859.png|width=606,height=276!

 

Structure of IVF data:

!image-2020-03-07-01-22-06-132.png|width=529,height=309!

3. Ann-benchmark tool could be found in: 
[https://github.com/irvingzhang/ann-benchmarks].

Benchmark results (Single Thread, 2.5GHz * 2CPU, 16GB RAM, 
nprobe=8,16,32,64,128,256, centroids=4*sqrt(N), where N the size of dataset):

1) Glove-1.2M-25D-Angular: index build + training cost 706s, qps: 18.8~49.6, 
recall: 76.8%~99.7%

!https://intranetproxy.alipay.com/skylark/lark/0/2020/png/35769/1583504416262-89784074-c9dc-4489-99a1-5e4b3c76e5fc.png|width=624,height=430!

 

2) Glove-1.2M-100D-Angular: index build + training cost 2487s, qps: 12.2~38.3, 
recall 65.8%~96.3%

!https://intranetproxy.alipay.com/skylark/lark/0/2020/png/35769/1583510066130-b4fbcb29-8ad7-4ff2-99ce-c52f7c27826e.png|width=679,height=468!

3) Sift-1M-128D-Euclidean: index build + training cost 2397s, qps 14.8~38.2, 
recall 71.1%~99.2%

!https://intranetproxy.alipay.com/skylark/lark/0/2020/png/35769/1583515010497-20b74f41-72c3-48ce-a929-1cbfbd6a6423.png|width=691,height=476!

 

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: image-2020-03-07-01-22-06-132.png, 
> image-2020-03-07-01-25-58-047.png, image-2020-03-07-01-27-12-859.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenar

[GitHub] [lucene-solr] atris commented on issue #1294: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches

2020-03-06 Thread GitBox
atris commented on issue #1294: LUCENE-9074: Slice Allocation Control Plane For 
Concurrent Searches
URL: https://github.com/apache/lucene-solr/pull/1294#issuecomment-595884480
 
 
   @jpountz Raised another iteration, please let me know your thoughts and 
comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1319: LUCENE-9164: process all events before closing gracefully

2020-03-06 Thread GitBox
dnhatn commented on a change in pull request #1319: LUCENE-9164: process all 
events before closing gracefully
URL: https://github.com/apache/lucene-solr/pull/1319#discussion_r389028879
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -299,7 +300,76 @@ static int getActualMaxDocs() {
   final FieldNumbers globalFieldNumberMap;
 
   final DocumentsWriter docWriter;
-  private final Queue eventQueue = new ConcurrentLinkedQueue<>();
+  private final CloseableQueue eventQueue = new CloseableQueue(this);
+
+  static final class CloseableQueue implements Closeable {
+private volatile boolean closed = false;
+private final Semaphore permits = new Semaphore(Integer.MAX_VALUE);
+private final Queue queue = new ConcurrentLinkedQueue<>();
+private final IndexWriter writer;
+
+CloseableQueue(IndexWriter writer) {
+  this.writer = writer;
+}
+
+private void tryAcquire() {
+  if (permits.tryAcquire() == false) {
+throw new AlreadyClosedException("queue is closed");
+  }
+  if (closed) {
+throw new AlreadyClosedException("queue is closed");
+  }
+}
+
+boolean add(Event event) {
+  tryAcquire();
+  try {
+return queue.add(event);
+  } finally {
+permits.release();
+  }
+}
+
+void processEvents() throws IOException {
+  tryAcquire();
+  try {
+processEventsInternal();
+  }finally {
 
 Review comment:
   nit: space after `{`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1319: LUCENE-9164: process all events before closing gracefully

2020-03-06 Thread GitBox
dnhatn commented on a change in pull request #1319: LUCENE-9164: process all 
events before closing gracefully
URL: https://github.com/apache/lucene-solr/pull/1319#discussion_r389028289
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -299,7 +300,76 @@ static int getActualMaxDocs() {
   final FieldNumbers globalFieldNumberMap;
 
   final DocumentsWriter docWriter;
-  private final Queue eventQueue = new ConcurrentLinkedQueue<>();
+  private final CloseableQueue eventQueue = new CloseableQueue(this);
+
+  static final class CloseableQueue implements Closeable {
+private volatile boolean closed = false;
+private final Semaphore permits = new Semaphore(Integer.MAX_VALUE);
+private final Queue queue = new ConcurrentLinkedQueue<>();
+private final IndexWriter writer;
+
+CloseableQueue(IndexWriter writer) {
+  this.writer = writer;
+}
+
+private void tryAcquire() {
+  if (permits.tryAcquire() == false) {
+throw new AlreadyClosedException("queue is closed");
+  }
+  if (closed) {
+throw new AlreadyClosedException("queue is closed");
+  }
+}
+
+boolean add(Event event) {
+  tryAcquire();
+  try {
+return queue.add(event);
+  } finally {
+permits.release();
+  }
+}
+
+void processEvents() throws IOException {
+  tryAcquire();
+  try {
+processEventsInternal();
+  }finally {
+permits.release();
+  }
+}
+private void processEventsInternal() throws IOException {
 
 Review comment:
   nit: add a new line


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1319: LUCENE-9164: process all events before closing gracefully

2020-03-06 Thread GitBox
dnhatn commented on a change in pull request #1319: LUCENE-9164: process all 
events before closing gracefully
URL: https://github.com/apache/lucene-solr/pull/1319#discussion_r389028473
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -299,7 +300,76 @@ static int getActualMaxDocs() {
   final FieldNumbers globalFieldNumberMap;
 
   final DocumentsWriter docWriter;
-  private final Queue eventQueue = new ConcurrentLinkedQueue<>();
+  private final CloseableQueue eventQueue = new CloseableQueue(this);
+
+  static final class CloseableQueue implements Closeable {
 
 Review comment:
   I am not sure if `EventQueue` is a better name?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1319: LUCENE-9164: process all events before closing gracefully

2020-03-06 Thread GitBox
dnhatn commented on a change in pull request #1319: LUCENE-9164: process all 
events before closing gracefully
URL: https://github.com/apache/lucene-solr/pull/1319#discussion_r389029514
 
 

 ##
 File path: lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java
 ##
 @@ -3773,7 +3774,58 @@ public void testRefreshAndRollbackConcurrently() throws 
Exception {
   stopped.set(true);
   indexer.join();
   refresher.join();
+  if (w.getTragicException() != null) {
+w.getTragicException().printStackTrace();
 
 Review comment:
   I think we don't need to print the stack trace here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053640#comment-17053640
 ] 

Michael Froh commented on LUCENE-8962:
--

bq. With a slightly refactored IW we can share the merge logic and let the 
reader re-write itself since we are talking about very small segments the 
overhead is very small. This would in turn mean that we are doing the work 
twice ie. the IW would do its normal work and might merge later etc.

Just to provide a bit more context, for the case where my team uses this 
change, we're replicating the index (think Solr master/slave) from "writers" to 
many "searchers", so we're avoiding doing the work many times.

An earlier (less invasive) approach I tried to address the small flushed 
segments problem was roughly: call commit on writer, hard link the commit files 
to another filesystem directory to "clone" the index, open an IW on that 
directory, merge small segments on the clone, let searchers replicate from the 
clone. That approach does mean that the merging work happens twice (since the 
"real" index doesn't benefit from the merge on the clone), but it doesn't 
involve any changes in Lucene.

Maybe that less-invasive approach is a better way to address this. It's 
certainly more consistent with [~simonw]'s suggestion above.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-03-06 Thread Kevin Watters (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053641#comment-17053641
 ] 

Kevin Watters commented on SOLR-13749:
--

having a local param like method=xcjf could trigger the xcjf query parser if we 
want.  There are some complications.  Currently, XCJF benefits greatly by some 
additional configuration for that query parser to specify the field in which a 
collection has been routed on.  The current join query parsers aren't defined 
by default in the solrconfig.xml . by merging together the functionality of 
these 2 query parsers, we might want to explicitly define the join query parser 
in the solr config by default.  

Additionally, there are many query parsers beyond xcjf that are really join 
query parsers.  

"child", and "parent" should also be considered "join" query parsers if we want 
to fully go to a consolidated join query parser model.  

We'll try to be responsive to issues on this ticket, however, I'm not sure how 
much bandwidth we will have for larger refactors related to xcjf.  My 
preference would be that we leave it as is.  This is what we were asked to 
develop and contribute back so we'd like to keep it as close to the original 
contribution as possible.  If we collectively want to wrangle all of those join 
parsers into a single consolidated join query parser perhaps we could track 
that as a different issue/ticket.

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> should be specified.  
> If neither of zkHost or solrUrl are specified, the local Zookeeper cluster 
> will be used. ( optional )|
> |solrUrl|Optional|The URL of the external Solr node to be queried ( optional 
> )|
> |from|Required|The join key field name in the external collection ( required 
> )|
> |to|Required|The join key field name in the local collection|
> |v|See Note|The query to be executed against the external Solr collection to 
> retrieve the set of join key values.  
> Note:  The original query can be passed at the end of the string or as the 
> "v" parameter.  
> It's recommended to use query parameter substitution with the "v" parameter 
> to ensure no issues arise with the default query parsers.|
> |routed| |true / false.  If true, the XCJF query will use each shard's hash 
> range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but 
> it depends on

[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1319: LUCENE-9164: process all events before closing gracefully

2020-03-06 Thread GitBox
dnhatn commented on a change in pull request #1319: LUCENE-9164: process all 
events before closing gracefully
URL: https://github.com/apache/lucene-solr/pull/1319#discussion_r389028879
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -299,7 +300,76 @@ static int getActualMaxDocs() {
   final FieldNumbers globalFieldNumberMap;
 
   final DocumentsWriter docWriter;
-  private final Queue eventQueue = new ConcurrentLinkedQueue<>();
+  private final CloseableQueue eventQueue = new CloseableQueue(this);
+
+  static final class CloseableQueue implements Closeable {
+private volatile boolean closed = false;
+private final Semaphore permits = new Semaphore(Integer.MAX_VALUE);
+private final Queue queue = new ConcurrentLinkedQueue<>();
+private final IndexWriter writer;
+
+CloseableQueue(IndexWriter writer) {
+  this.writer = writer;
+}
+
+private void tryAcquire() {
+  if (permits.tryAcquire() == false) {
+throw new AlreadyClosedException("queue is closed");
+  }
+  if (closed) {
+throw new AlreadyClosedException("queue is closed");
+  }
+}
+
+boolean add(Event event) {
+  tryAcquire();
+  try {
+return queue.add(event);
+  } finally {
+permits.release();
+  }
+}
+
+void processEvents() throws IOException {
+  tryAcquire();
+  try {
+processEventsInternal();
+  }finally {
 
 Review comment:
   nit: space after `}`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14073) Fix segment look ahead NPE in CollapsingQParserPlugin

2020-03-06 Thread Joel Bernstein (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-14073:
--
Attachment: SOLR-14073.patch

> Fix segment look ahead NPE in CollapsingQParserPlugin
> -
>
> Key: SOLR-14073
> URL: https://issues.apache.org/jira/browse/SOLR-14073
> Project: Solr
>  Issue Type: Bug
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>Priority: Major
> Attachments: SOLR-14073.patch, SOLR-14073.patch, SOLR-14073.patch
>
>
> The CollapsingQParserPlugin has a bug that if every segment is not visited 
> during the collect it throws an NPE. This causes the CollapsingQParserPlugin 
> to not work when used with any feature that short circuits the segments 
> during the collect. This includes using the CollapsingQParserPlugin twice in 
> the same query and the time limiting collector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13944) CollapsingQParserPlugin throws NPE instead of bad request

2020-03-06 Thread Lucene/Solr QA (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053653#comment-17053653
 ] 

Lucene/Solr QA commented on SOLR-13944:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
11s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  1m 35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  1m 35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  1m 35s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 74m 
57s{color} | {color:green} core in the patch passed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 82m 23s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | SOLR-13944 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12995852/SOLR-13944.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  |
| uname | Linux lucene2-us-west.apache.org 4.4.0-170-generic #199-Ubuntu SMP 
Thu Nov 14 01:45:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / c73d2c1 |
| ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 |
| Default Java | LTS |
|  Test Results | 
https://builds.apache.org/job/PreCommit-SOLR-Build/699/testReport/ |
| modules | C: solr/core U: solr/core |
| Console output | 
https://builds.apache.org/job/PreCommit-SOLR-Build/699/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> CollapsingQParserPlugin throws NPE instead of bad request
> -
>
> Key: SOLR-13944
> URL: https://issues.apache.org/jira/browse/SOLR-13944
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.3.1
>Reporter: Stefan
>Assignee: Munendra S N
>Priority: Minor
> Attachments: SOLR-13944.patch
>
>
>  I noticed the following NPE:
> {code:java}
> java.lang.NullPointerException at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1021)
>  at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1081)
>  at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:230)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1602)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1419)
>  at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
> {code}
> If I am correct, the problem was already addressed in SOLR-8807. The fix does 
> was not working in this case though, because of a syntax error in the query 
> (I used the local parameter syntax twice instead of combining it). The 
> relevant part of the query is:
> {code:java}
> &fq={!tag=collapser}{!collapse field=productId sort='merchantOrder asc, price 
> asc, id asc'}
> {code}
> After discussing that on the mailing list, I was asked to open a ticket, 
> because this situation should result in a bad request instead of a 
> NullpointerException (see 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201911.mbox/%3CCAMJgJxTuSb%3D8szO8bvHiAafJOs08O_NMB4pcaHOXME4Jj-GO2A%40mail.gmail.com%3E])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: glove-100-angular.png

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: glove-100-angular.png, glove-25-angular.png, 
> image-2020-03-07-01-22-06-132.png, image-2020-03-07-01-25-58-047.png, 
> image-2020-03-07-01-27-12-859.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: glove-25-angular.png

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: glove-100-angular.png, glove-25-angular.png, 
> image-2020-03-07-01-22-06-132.png, image-2020-03-07-01-25-58-047.png, 
> image-2020-03-07-01-27-12-859.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Comment: was deleted

(was: 1. My personal git branch: 
[https://github.com/irvingzhang/lucene-solr/tree/jira/lucene-9136-ann-ivfflat].

2. The vector format is as follows, 

!image-2020-03-07-01-25-58-047.png|width=535,height=297!

 

Structure of IVF index meta is as follows,

!image-2020-03-07-01-27-12-859.png|width=606,height=276!

 

Structure of IVF data:

!image-2020-03-07-01-22-06-132.png|width=529,height=309!

3. Ann-benchmark tool could be found in: 
[https://github.com/irvingzhang/ann-benchmarks].

Benchmark results (Single Thread, 2.5GHz * 2CPU, 16GB RAM, 
nprobe=8,16,32,64,128,256, centroids=4*sqrt(N), where N the size of dataset):

1) Glove-1.2M-25D-Angular: index build + training cost 706s, qps: 18.8~49.6, 
recall: 76.8%~99.7%

!https://intranetproxy.alipay.com/skylark/lark/0/2020/png/35769/1583504416262-89784074-c9dc-4489-99a1-5e4b3c76e5fc.png|width=624,height=430!

 

2) Glove-1.2M-100D-Angular: index build + training cost 2487s, qps: 12.2~38.3, 
recall 65.8%~96.3%

!https://intranetproxy.alipay.com/skylark/lark/0/2020/png/35769/1583510066130-b4fbcb29-8ad7-4ff2-99ce-c52f7c27826e.png|width=679,height=468!

3) Sift-1M-128D-Euclidean: index build + training cost 2397s, qps 14.8~38.2, 
recall 71.1%~99.2%

!https://intranetproxy.alipay.com/skylark/lark/0/2020/png/35769/1583515010497-20b74f41-72c3-48ce-a929-1cbfbd6a6423.png|width=691,height=476!

 )

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: glove-100-angular.png, glove-25-angular.png, 
> image-2020-03-07-01-22-06-132.png, image-2020-03-07-01-25-58-047.png, 
> image-2020-03-07-01-27-12-859.png, sift-128-euclidean.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
>

[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: sift-128-euclidean.png

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: glove-100-angular.png, glove-25-angular.png, 
> image-2020-03-07-01-22-06-132.png, image-2020-03-07-01-25-58-047.png, 
> image-2020-03-07-01-27-12-859.png, sift-128-euclidean.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053668#comment-17053668
 ] 

Xin-Chun Zhang commented on LUCENE-9136:


1. My personal git branch: 
[https://github.com/irvingzhang/lucene-solr/tree/jira/lucene-9136-ann-ivfflat].

2. The vector format is as follows, 

!image-2020-03-07-01-25-58-047.png|width=535,height=297!

 

Structure of IVF index meta is as follows,

!image-2020-03-07-01-27-12-859.png|width=606,height=276!

 

Structure of IVF data:

!image-2020-03-07-01-22-06-132.png|width=529,height=309!

3. Ann-benchmark tool could be found in: 
[https://github.com/irvingzhang/ann-benchmarks].

Benchmark results (Single Thread, 2.5GHz * 2CPU, 16GB RAM, 
nprobe=8,16,32,64,128,256, centroids=4*sqrt(N), where N the size of dataset):

1) Glove-1.2M-25D-Angular: index build + training cost 706s, qps: 18.8~49.6, 
recall: 76.8%~99.7%

!glove-25-angular.png|width=653,height=450!

 

2) Glove-1.2M-100D-Angular: index build + training cost 2487s, qps: 12.2~38.3, 
recall 65.8%~96.3%

!glove-100-angular.png|width=671,height=462!

3) Sift-1M-128D-Euclidean: index build + training cost 2397s, qps 14.8~38.2, 
recall 71.1%~99.2%

!sift-128-euclidean.png|width=684,height=471!

 

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: glove-100-angular.png, glove-25-angular.png, 
> image-2020-03-07-01-22-06-132.png, image-2020-03-07-01-25-58-047.png, 
> image-2020-03-07-01-27-12-859.png, sift-128-euclidean.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits

[jira] [Created] (LUCENE-9266) ant nightly-smoke fails due to presence of build.gradle

2020-03-06 Thread Mike Drob (Jira)
Mike Drob created LUCENE-9266:
-

 Summary: ant nightly-smoke fails due to presence of build.gradle
 Key: LUCENE-9266
 URL: https://issues.apache.org/jira/browse/LUCENE-9266
 Project: Lucene - Core
  Issue Type: Task
Reporter: Mike Drob


Seen on Jenkins - 
[https://builds.apache.org/job/Lucene-Solr-SmokeRelease-master/1617/console]

 

Reproduced locally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



  1   2   >