date:20220308

[GitHub] [lucene] spyk commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

2022-03-08 Thread GitBox



spyk commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-1061732060


   > I'd be willing to take care of adding it, if you'd prefer. The CHANGES.txt 
file changes often and gets conflicts which, although trivial, can be kind of 
annoying. At this point it's possible we may have missed the boat for 9.1, so 
that might also be an argument for me handling figuring out the right place to 
put the CHANGES.txt, if you're ok with that.
   
   Thanks @magibney , yes please, that'd be great if you could. I'd hate to 
take up your valuable time until I figure the whole process out :) But please 
let me know if you need anything from me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov merged pull request #735: Fix: typo + +minScore.

2022-03-08 Thread GitBox



msokolov merged pull request #735:
URL: https://github.com/apache/lucene/pull/735


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links

2022-03-08 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502990#comment-17502990
 ] 

Robert Muir commented on LUCENE-10457:
--

Sorry, a symbolic link is a whole different thing than a file. It has different 
semantics / platform specifics, etc. A symlink to a directory is not a 
directory.

I'm a hard -1 to randomly doing this from LuceneTestCase. It isn't just fixing 
a test or two, its the ongoing costs of having confusing tests. It is literally 
3 lines of code for you to do this in your symlink-using test.

> LuceneTestCase.createTempDir could randomly return symbolic links
> -
>
> Key: LUCENE-10457
> URL: https://issues.apache.org/jira/browse/LUCENE-10457
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Mike Drob
>Priority: Major
>
> When we are creating temporary directories to use for other Lucene functions, 
> we could occasionally provide symbolic links instead of direct references to 
> directories. If the system running tests doesn't support symbolic links, then 
> we should ignore this option.
> Providing links would be useful to test scenarios for example where users 
> have a symbolic link for the "current" index directory and then rotate that 
> over time but applications still use the same link.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links

2022-03-08 Thread Mike Drob (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503016#comment-17503016
 ] 

Mike Drob commented on LUCENE-10457:


> Sorry, a symbolic link is a whole different thing than a file. It has 
> different semantics / platform specifics, etc. A symlink to a directory is 
> not a directory.

In what ways? If you pass a symlink to a directory to {{Files.isDirectory}} it 
will return true. You have to specifically ask Java for 
{{Link_Option.NOFOLLOW}} to get the distinction.

> I'm a hard -1 to randomly doing this from LuceneTestCase. 

Like I said earlier, LuceneTestCase isn't the only place that could handle 
this. Would you be similarly opposed to a FilterFileSystem that does this?

> It isn't just fixing a test or two, its the ongoing costs of having confusing 
> tests. It is literally 3 lines of code for you to do this in your 
> symlink-using test.

Maybe some additional context for my motivation would be helpful? We had an 
issue with symlinks in Solr; it wasn't a Lucene related problem; we fixed it; 
everybody is happy and moved on. I started wondering what other places might be 
susceptible to mishandling symlinks. Instead of trying to individually search 
for every single place our code that could be affected, I started looking for 
common areas that might yield high returns.

Switching LuceneTestCase.createTempDir to sometimes return symlinks to 
directories broke exactly three places in Lucene code. Those are easy to 
reproduce with three lines of code like you suggest. The concern is that I 
wouldn't have found those three places without a larger, more general change 
that sprayed symlinks all over the place. And we won't find similar issues in 
the future, which is the point of the randomized testing framework in the first 
place. Have a robust testing framework and find issues that we don't even know 
exist.

For example, right now, running {{IndexSplitter   
+}} doesn't work when {{destDir}} is a symlink to a directory. How 
did I know? I wasn't looking for IndexSplitter issues. Should it work? I would 
imagine yes. Or at the very least, give a better error than it currently does.

The core indexing and querying library never has to worry about symlinks, sure. 
But we have plenty of operational tools that interact with the filesystem as 
defined by users. And we don't control that.

> LuceneTestCase.createTempDir could randomly return symbolic links
> -
>
> Key: LUCENE-10457
> URL: https://issues.apache.org/jira/browse/LUCENE-10457
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Mike Drob
>Priority: Major
>
> When we are creating temporary directories to use for other Lucene functions, 
> we could occasionally provide symbolic links instead of direct references to 
> directories. If the system running tests doesn't support symbolic links, then 
> we should ignore this option.
> Providing links would be useful to test scenarios for example where users 
> have a symbolic link for the "current" index directory and then rotate that 
> over time but applications still use the same link.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] magibney merged pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

2022-03-08 Thread GitBox



magibney merged pull request #380:
URL: https://github.com/apache/lucene/pull/380


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10171) Caching issue on dictionary-based OpenNLPLemmatizerFilterFactory

2022-03-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503082#comment-17503082
 ] 

ASF subversion and git services commented on LUCENE-10171:
--

Commit 8afec33e747ec81c2301a4b099bd26b4195a556e in lucene's branch 
refs/heads/main from Spyros Kapnissis
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8afec33 ]

LUCENE-10171: OpenNLPOpsFactory should directly cache DictionaryLemmatizer 
objects (#380)

Instead of caching dictionary strings and building multiple redundant 
DictionaryLemmatizer objects.

Co-authored-by: Michael Gibney 

> Caching issue on dictionary-based OpenNLPLemmatizerFilterFactory
> 
>
> Key: LUCENE-10171
> URL: https://issues.apache.org/jira/browse/LUCENE-10171
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 9.0, 7.7.3, 8.10
>Reporter: Spyros Kapnissis
>Priority: Major
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> When providing a lemmas.txt dictionary file, OpenNLPLemmatizerFilterFactory 
> caches internally only the string format of the dictionary, and not the 
> DictionaryLemmatizer object. This results in parsing and creating a new 
> DictionaryLemmatizer object every time the 
> OpenNLPLemmatizerFilterFactory.create() is called.
> In our case, with a large lemmas.txt file (5MB) and the 
> OpenNLPLemmatizerFilter used in many fields across our setup and in multiple 
> collections (we use Solr), we had several random OOM issues and generally 
> high server load due to GC activity. After heap dump analysis we noticed few 
> thousands of DictionaryLemmatizer instances of around 80MB each.
> By switching the caching to the DictionaryLemmatizer instead of the String, 
> we were able to resolve these issues. I will be attaching a PR for review, 
> please let me know of any comments.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] magibney commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

2022-03-08 Thread GitBox



magibney commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-1062061030


   Merged to `main` and backported to `branch_9x`; this should be available as 
of release 9.1.
   Thanks @spyk!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10171) Caching issue on dictionary-based OpenNLPLemmatizerFilterFactory

2022-03-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503097#comment-17503097
 ] 

ASF subversion and git services commented on LUCENE-10171:
--

Commit a03306724664be1a5f4047d845736d85b03ddc41 in lucene's branch 
refs/heads/branch_9x from Spyros Kapnissis
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a033067 ]

LUCENE-10171: OpenNLPOpsFactory should directly cache DictionaryLemmatizer 
objects (#380)

Instead of caching dictionary strings and building multiple redundant 
DictionaryLemmatizer objects.

Co-authored-by: Michael Gibney 

> Caching issue on dictionary-based OpenNLPLemmatizerFilterFactory
> 
>
> Key: LUCENE-10171
> URL: https://issues.apache.org/jira/browse/LUCENE-10171
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 9.0, 7.7.3, 8.10
>Reporter: Spyros Kapnissis
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> When providing a lemmas.txt dictionary file, OpenNLPLemmatizerFilterFactory 
> caches internally only the string format of the dictionary, and not the 
> DictionaryLemmatizer object. This results in parsing and creating a new 
> DictionaryLemmatizer object every time the 
> OpenNLPLemmatizerFilterFactory.create() is called.
> In our case, with a large lemmas.txt file (5MB) and the 
> OpenNLPLemmatizerFilter used in many fields across our setup and in multiple 
> collections (we use Solr), we had several random OOM issues and generally 
> high server load due to GC activity. After heap dump analysis we noticed few 
> thousands of DictionaryLemmatizer instances of around 80MB each.
> By switching the caching to the DictionaryLemmatizer instead of the String, 
> we were able to resolve these issues. I will be attaching a PR for review, 
> please let me know of any comments.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10171) Caching issue on dictionary-based OpenNLPLemmatizerFilterFactory

2022-03-08 Thread Michael Gibney (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney resolved LUCENE-10171.
-
Fix Version/s: 9.1
Lucene Fields:   (was: New,Patch Available)
 Assignee: Michael Gibney
   Resolution: Fixed

> Caching issue on dictionary-based OpenNLPLemmatizerFilterFactory
> 
>
> Key: LUCENE-10171
> URL: https://issues.apache.org/jira/browse/LUCENE-10171
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 9.0, 7.7.3, 8.10
>Reporter: Spyros Kapnissis
>Assignee: Michael Gibney
>Priority: Major
> Fix For: 9.1
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> When providing a lemmas.txt dictionary file, OpenNLPLemmatizerFilterFactory 
> caches internally only the string format of the dictionary, and not the 
> DictionaryLemmatizer object. This results in parsing and creating a new 
> DictionaryLemmatizer object every time the 
> OpenNLPLemmatizerFilterFactory.create() is called.
> In our case, with a large lemmas.txt file (5MB) and the 
> OpenNLPLemmatizerFilter used in many fields across our setup and in multiple 
> collections (we use Solr), we had several random OOM issues and generally 
> high server load due to GC activity. After heap dump analysis we noticed few 
> thousands of DictionaryLemmatizer instances of around 80MB each.
> By switching the caching to the DictionaryLemmatizer instead of the String, 
> we were able to resolve these issues. I will be attaching a PR for review, 
> please let me know of any comments.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10371) Make IndexRearranger able to arrange segment into a determined order

2022-03-08 Thread Julie Tibshirani (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10371:
--
Fix Version/s: 9.1

> Make IndexRearranger able to arrange segment into a determined order
> 
>
> Key: LUCENE-10371
> URL: https://issues.apache.org/jira/browse/LUCENE-10371
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Patrick Zhai
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Previously when I tried to make change to luceneutil to let it use 
> {{IndexRearranger}} for a faster deterministic index construction, I found 
> that even each segment contains the same documents set, the order of segments 
> will impact the estimated hit number (using BMW): 
> [https://markmail.org/message/zl6zsqvbg7nwfq6w]
> At that time the discussion tend to tolerant the small hit count difference 
> to resolve the issue, after some time when I discuss this issue again with 
> [~mikemccand] , we thought it might also be a good idea to just add ability 
> of rearranging the segments order to {{IndexRearranger}}, so that we can 
> ensure each time the rearranged index is truly the same.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring

2022-03-08 Thread Julie Tibshirani (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10236:
--
Fix Version/s: 9.1

> CombinedFieldsQuery to use fieldAndWeights.values() when constructing 
> MultiNormsLeafSimScorer for scoring
> -
>
> Key: LUCENE-10236
> URL: https://issues.apache.org/jira/browse/LUCENE-10236
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Zach Chen
>Assignee: Zach Chen
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> This is a spin-off issue from discussion in 
> [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a 
> quick fix in CombinedFieldsQuery scoring.
> Currently CombinedFieldsQuery would use a constructed 
> [fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421]
>  object to create a MultiNormsLeafSimScorer for scoring, but the fields 
> object may contain duplicated field-weight pairs as it is [built from looping 
> over 
> fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414],
>  resulting into duplicated norms being added during scoring calculation in 
> MultiNormsLeafSimScorer. 
> E.g. for CombinedFieldsQuery with two fields and two values matching a 
> particular doc:
> {code:java}
> CombinedFieldQuery query =
> new CombinedFieldQuery.Builder()
> .addField("field1", (float) 1.0)
> .addField("field2", (float) 1.0)
> .addTerm(new BytesRef("foo"))
> .addTerm(new BytesRef("zoo"))
> .build(); {code}
> I would imagine the scoring to be based on the following:
>  # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + 
> freq(field1:zoo) + freq(field2:zoo)
>  # Sum of norms on doc = norm(field1) + norm(field2)
> but the current logic would use the following for scoring:
>  # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + 
> freq(field1:zoo) + freq(field2:zoo)
>  # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + 
> norm(field2)
>  
> In addition, this differs from how MultiNormsLeafSimScorer is constructed 
> from CombinedFieldsQuery explain function, which [uses 
> fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389]
>  and does not contain duplicated field-weight pairs. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10339) Add require derective for jdk.management to core's module descriptor.

2022-03-08 Thread Julie Tibshirani (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10339:
--
Fix Version/s: 9.1
   10.0 (main)

> Add require derective for jdk.management to core's module descriptor.
> -
>
> Key: LUCENE-10339
> URL: https://issues.apache.org/jira/browse/LUCENE-10339
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Priority: Major
> Fix For: 9.1, 10.0 (main)
>
> Attachments: Screenshot from 2021-12-25 18-04-55.png, Screenshot from 
> 2021-12-25 20-11-38.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> IntelliJ IDEA warns as follows in o.a.l.u.RamUsageEstimator.
>  !Screenshot from 2021-12-25 18-04-55.png! 
>  !Screenshot from 2021-12-25 20-11-38.png! 
> This means we would need "require" or "require static" directive(s) in core's 
> module descriptor to make RamUsageEstimator correctly work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-03-08 Thread GitBox



vigyasharma commented on a change in pull request #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r822174425



##
File path: lucene/core/src/java/org/apache/lucene/index/FieldInfos.java
##
@@ -352,6 +352,14 @@ public FieldDimensions(int dimensionCount, int 
indexDimensionCount, int dimensio
   this.softDeletesFieldName = softDeletesFieldName;
 }
 
+public void verifyFieldInfo(FieldInfo fi) {

Review comment:
   This method makes sure that incoming index fields are compatible with 
the destination index, e.g. vectors have the same dimensions and use the same 
similarity function.
   
   Changed access to package private. I retained method name - 
`verifyFieldInfo()`, as it internally calls `verifySoftDeletedFieldName` and 
`verifySameSchema`, which have a similar naming style. 
   
   I can change it if `assert*()` is more in line with the Lucene convention. 
In which case, I also think that it should only return a condition and the 
asserts should be done by the caller.. Let me know..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-03-08 Thread GitBox



vigyasharma commented on a change in pull request #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r822174617



##
File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
##
@@ -3121,147 +3125,265 @@ private void validateMergeReader(CodecReader leaf) {
*/
   public long addIndexes(CodecReader... readers) throws IOException {
 ensureOpen();
-
-// long so we can detect int overflow:

Review comment:
   I think it was a miss. Added it back, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-03-08 Thread GitBox



vigyasharma commented on a change in pull request #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r822175645



##
File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
##
@@ -3121,147 +3125,265 @@ private void validateMergeReader(CodecReader leaf) {
*/
   public long addIndexes(CodecReader... readers) throws IOException {
 ensureOpen();
-
-// long so we can detect int overflow:
-long numDocs = 0;
 long seqNo;
-try {
-  if (infoStream.isEnabled("IW")) {
-infoStream.message("IW", "flush at addIndexes(CodecReader...)");
-  }
-  flush(false, true);
+long numDocs = 0;
+final int mergeTimeoutInSeconds = 600;
 
-  String mergedName = newSegmentName();
-  int numSoftDeleted = 0;
-  for (CodecReader leaf : readers) {
-numDocs += leaf.numDocs();
+try {
+  // Best effort up front validations
+  for (CodecReader leaf: readers) {
 validateMergeReader(leaf);
-if (softDeletesEnabled) {
-  Bits liveDocs = leaf.getLiveDocs();
-  numSoftDeleted +=
-  PendingSoftDeletes.countSoftDeletes(
-  DocValuesFieldExistsQuery.getDocValuesDocIdSetIterator(
-  config.getSoftDeletesField(), leaf),
-  liveDocs);
+for (FieldInfo fi: leaf.getFieldInfos()) {
+  globalFieldNumberMap.verifyFieldInfo(fi);
 }
+numDocs += leaf.numDocs();
   }
-
-  // Best-effort up front check:
   testReserveDocs(numDocs);
 
-  final IOContext context =
-  new IOContext(
-  new MergeInfo(Math.toIntExact(numDocs), -1, false, 
UNBOUNDED_MAX_MERGE_SEGMENTS));
+  synchronized (this) {
+ensureOpen();
+if (merges.areEnabled() == false) {
+  throw new UnsupportedOperationException("Merges are disabled on 
current writer. " +

Review comment:
   Changed to AlreadyClosedException.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-03-08 Thread GitBox



vigyasharma commented on a change in pull request #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r822184400



##
File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
##
@@ -3121,147 +3125,265 @@ private void validateMergeReader(CodecReader leaf) {
*/
   public long addIndexes(CodecReader... readers) throws IOException {
 ensureOpen();
-
-// long so we can detect int overflow:
-long numDocs = 0;
 long seqNo;
-try {
-  if (infoStream.isEnabled("IW")) {
-infoStream.message("IW", "flush at addIndexes(CodecReader...)");
-  }
-  flush(false, true);
+long numDocs = 0;
+final int mergeTimeoutInSeconds = 600;
 
-  String mergedName = newSegmentName();
-  int numSoftDeleted = 0;
-  for (CodecReader leaf : readers) {
-numDocs += leaf.numDocs();
+try {
+  // Best effort up front validations
+  for (CodecReader leaf: readers) {
 validateMergeReader(leaf);
-if (softDeletesEnabled) {
-  Bits liveDocs = leaf.getLiveDocs();
-  numSoftDeleted +=
-  PendingSoftDeletes.countSoftDeletes(
-  DocValuesFieldExistsQuery.getDocValuesDocIdSetIterator(
-  config.getSoftDeletesField(), leaf),
-  liveDocs);
+for (FieldInfo fi: leaf.getFieldInfos()) {
+  globalFieldNumberMap.verifyFieldInfo(fi);
 }
+numDocs += leaf.numDocs();
   }
-
-  // Best-effort up front check:
   testReserveDocs(numDocs);
 
-  final IOContext context =
-  new IOContext(
-  new MergeInfo(Math.toIntExact(numDocs), -1, false, 
UNBOUNDED_MAX_MERGE_SEGMENTS));
+  synchronized (this) {
+ensureOpen();
+if (merges.areEnabled() == false) {
+  throw new UnsupportedOperationException("Merges are disabled on 
current writer. " +
+"Cannot execute addIndexes(CodecReaders...) API");
+}
+  }
+
+  MergePolicy mergePolicy = config.getMergePolicy();
+  MergePolicy.MergeSpecification spec = 
mergePolicy.findMerges(Arrays.asList(readers));

Review comment:
   We allow for cascading to proceed in background by calling a 
`maybeMerge()` at the end of `addIndexes()`.
   
   For `addIndexes(CodecReader...)`, we can only add segments to IW after the 
readers have gone through `SegmentMerger.merge()`. Until then, we don't have 
any segments to add, (unlike the `addIndexes(Directory...)` variant, where we 
already have segments provided in the input). So we wait for the first round of 
merges to complete, by blocking the API on the merges defined in the MergeSpec.
   
   Once these merges complete, we add segments to IW's SegmentInfos, and invoke 
a maybeMerge(), which will cascade any merges on top of these segments in




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-03-08 Thread GitBox



vigyasharma commented on a change in pull request #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r822184400



##
File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
##
@@ -3121,147 +3125,265 @@ private void validateMergeReader(CodecReader leaf) {
*/
   public long addIndexes(CodecReader... readers) throws IOException {
 ensureOpen();
-
-// long so we can detect int overflow:
-long numDocs = 0;
 long seqNo;
-try {
-  if (infoStream.isEnabled("IW")) {
-infoStream.message("IW", "flush at addIndexes(CodecReader...)");
-  }
-  flush(false, true);
+long numDocs = 0;
+final int mergeTimeoutInSeconds = 600;
 
-  String mergedName = newSegmentName();
-  int numSoftDeleted = 0;
-  for (CodecReader leaf : readers) {
-numDocs += leaf.numDocs();
+try {
+  // Best effort up front validations
+  for (CodecReader leaf: readers) {
 validateMergeReader(leaf);
-if (softDeletesEnabled) {
-  Bits liveDocs = leaf.getLiveDocs();
-  numSoftDeleted +=
-  PendingSoftDeletes.countSoftDeletes(
-  DocValuesFieldExistsQuery.getDocValuesDocIdSetIterator(
-  config.getSoftDeletesField(), leaf),
-  liveDocs);
+for (FieldInfo fi: leaf.getFieldInfos()) {
+  globalFieldNumberMap.verifyFieldInfo(fi);
 }
+numDocs += leaf.numDocs();
   }
-
-  // Best-effort up front check:
   testReserveDocs(numDocs);
 
-  final IOContext context =
-  new IOContext(
-  new MergeInfo(Math.toIntExact(numDocs), -1, false, 
UNBOUNDED_MAX_MERGE_SEGMENTS));
+  synchronized (this) {
+ensureOpen();
+if (merges.areEnabled() == false) {
+  throw new UnsupportedOperationException("Merges are disabled on 
current writer. " +
+"Cannot execute addIndexes(CodecReaders...) API");
+}
+  }
+
+  MergePolicy mergePolicy = config.getMergePolicy();
+  MergePolicy.MergeSpecification spec = 
mergePolicy.findMerges(Arrays.asList(readers));

Review comment:
   We allow for cascading to proceed in background by calling a 
`maybeMerge()` at the end of `addIndexes()`.
   
   For `addIndexes(CodecReader...)`, we can only add segments to IW after the 
readers have gone through `SegmentMerger.merge()`. Until then, we don't have 
any segments to add, (unlike the `addIndexes(Directory...)` variant, where we 
already have segments provided in the input). So we wait for the first round of 
merges to complete, by blocking the API on the merges defined in the MergeSpec.
   
   Once these merges complete, we add segments to IW's SegmentInfos, and invoke 
a maybeMerge(), which will cascade any merges on top of these segments in 
background




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-03-08 Thread GitBox



vigyasharma commented on a change in pull request #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r822185980



##
File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
##
@@ -3121,147 +3125,265 @@ private void validateMergeReader(CodecReader leaf) {
*/
   public long addIndexes(CodecReader... readers) throws IOException {
 ensureOpen();
-
-// long so we can detect int overflow:
-long numDocs = 0;
 long seqNo;
-try {
-  if (infoStream.isEnabled("IW")) {
-infoStream.message("IW", "flush at addIndexes(CodecReader...)");
-  }
-  flush(false, true);
+long numDocs = 0;
+final int mergeTimeoutInSeconds = 600;
 
-  String mergedName = newSegmentName();
-  int numSoftDeleted = 0;
-  for (CodecReader leaf : readers) {
-numDocs += leaf.numDocs();
+try {
+  // Best effort up front validations
+  for (CodecReader leaf: readers) {
 validateMergeReader(leaf);
-if (softDeletesEnabled) {
-  Bits liveDocs = leaf.getLiveDocs();
-  numSoftDeleted +=
-  PendingSoftDeletes.countSoftDeletes(
-  DocValuesFieldExistsQuery.getDocValuesDocIdSetIterator(
-  config.getSoftDeletesField(), leaf),
-  liveDocs);
+for (FieldInfo fi: leaf.getFieldInfos()) {
+  globalFieldNumberMap.verifyFieldInfo(fi);
 }
+numDocs += leaf.numDocs();
   }
-
-  // Best-effort up front check:
   testReserveDocs(numDocs);
 
-  final IOContext context =
-  new IOContext(
-  new MergeInfo(Math.toIntExact(numDocs), -1, false, 
UNBOUNDED_MAX_MERGE_SEGMENTS));
+  synchronized (this) {
+ensureOpen();
+if (merges.areEnabled() == false) {
+  throw new UnsupportedOperationException("Merges are disabled on 
current writer. " +
+"Cannot execute addIndexes(CodecReaders...) API");
+}
+  }
+
+  MergePolicy mergePolicy = config.getMergePolicy();
+  MergePolicy.MergeSpecification spec = 
mergePolicy.findMerges(Arrays.asList(readers));
+  boolean mergesComplete = false;
+  if (spec != null && spec.merges.size() > 0) {
+try {
+  spec.merges.forEach(addIndexesMergeSource::registerMerge);
+  mergeScheduler.merge(addIndexesMergeSource, 
MergeTrigger.ADD_INDEXES);
+  mergesComplete = spec.await();
+} finally {
+  if (!mergesComplete) {
+// nocommit -- ensure all intermediate files are deleted
+for (MergePolicy.OneMerge merge: spec.merges) {
+  deleteNewFiles(merge.getMergeInfo().files());
+}
+  }
+}
+  }
 
-  // TODO: somehow we should fix this merge so it's
-  // abortable so that IW.close(false) is able to stop it
-  TrackingDirectoryWrapper trackingDir = new 
TrackingDirectoryWrapper(directory);
-  Codec codec = config.getCodec();
-  // We set the min version to null for now, it will be set later by 
SegmentMerger
-  SegmentInfo info =
-  new SegmentInfo(
-  directoryOrig,
-  Version.LATEST,
-  null,
-  mergedName,
-  -1,
-  false,
-  codec,
-  Collections.emptyMap(),
-  StringHelper.randomId(),
-  Collections.emptyMap(),
-  config.getIndexSort());
-
-  SegmentMerger merger =
-  new SegmentMerger(
-  Arrays.asList(readers), info, infoStream, trackingDir, 
globalFieldNumberMap, context);
+  if (mergesComplete) {
+List infos = new ArrayList<>();
+long totalMaxDoc = 0;
+for (MergePolicy.OneMerge merge: spec.merges) {
+  totalMaxDoc += merge.totalMaxDoc;
+  if (merge.getMergeInfo() != null) {
+infos.add(merge.getMergeInfo());
+  }
+}
 
-  if (!merger.shouldMerge()) {
-return docWriter.getNextSequenceNumber();
+// nocommit -- add tests for this transactional behavior
+synchronized (this) {
+  if (infos.isEmpty() == false) {
+boolean success = false;
+try {
+  ensureOpen();
+  // Reserve the docs, just before we update SIS:
+  reserveDocs(totalMaxDoc);
+  success = true;
+} finally {
+  if (!success) {
+for (SegmentCommitInfo sipc : infos) {
+  // Safe: these files must exist
+  deleteNewFiles(sipc.files());
+}
+  }
+}
+segmentInfos.addAll(infos);
+checkpoint();
+  }
+  seqNo = docWriter.getNextSequenceNumber();
+}
+  } else {
+// We should normally not reach here, as an earlier call should throw 
an exception.
+throw new MergePolicy.MergeException("Could not complete merg

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-03-08 Thread GitBox



vigyasharma commented on a change in pull request #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r822189054



##
File path: lucene/core/src/java/org/apache/lucene/index/MergePolicy.java
##
@@ -567,6 +605,21 @@ public abstract MergeSpecification findMerges(
   MergeTrigger mergeTrigger, SegmentInfos segmentInfos, MergeContext 
mergeContext)
   throws IOException;
 
+  /**
+   * Define {@link OneMerge} operations for a list of codec readers. This call 
is used to
+   * define merges for input readers in {@link 
IndexWriter#addIndexes(CodecReader...)}.
+   * Default implementation adds all readers to a single merge. This can be 
overridden in custom
+   * merge policies.
+   *
+   * @param readers set of readers to merge into the main index
+   */
+  public MergeSpecification findMerges(List readers) throws 
IOException {
+OneMerge merge = new OneMerge(readers, leaf -> new MergeReader(leaf, 
leaf.getLiveDocs()));

Review comment:
   Yes, this change will modify addIndexes to leverage `MergePolicy` and 
`MergeScheduler`, without changing its end behavior. 
   
   I am thinking of doing a follow up PR with a merge policy that does a 1:1 
addIndexes call, where the merger creates one segment each for every provided 
reader. It would create a faster, more concurrent `addIndexes(CodecReader...)`, 
at the cost of deferring some merges to be done later in background. 
   
   Which, I believe is similar to the behavior in `addIndexes(Directory...)` - 
all incoming segments are simply added to IW, and any merging happens in later 
in background.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-03-08 Thread GitBox



vigyasharma commented on a change in pull request #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r822190688



##
File path: lucene/core/src/java/org/apache/lucene/index/MergePolicy.java
##
@@ -813,12 +866,24 @@ protected final boolean verbose(MergeContext 
mergeContext) {
   }
 
   static final class MergeReader {
+final CodecReader codecReader;
 final SegmentReader reader;

Review comment:
   So far, I see `SegmentReader` being used only in `ReadersAndUpdates` 
[here](https://github.com/apache/lucene/blob/main/lucene%2Fcore%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Findex%2FReadersAndUpdates.java#L786-L793).
 I am not really sure if `createNewReaderWithLatestLiveDocs()` can also work 
with a CodecReader.. Will look into this more.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-03-08 Thread GitBox



vigyasharma commented on a change in pull request #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r822195662



##
File path: lucene/core/src/java/org/apache/lucene/index/NoMergePolicy.java
##
@@ -39,6 +40,11 @@ public MergeSpecification findMerges(
 return null;
   }
 
+  @Override
+  public MergeSpecification findMerges(List readers) throws 
IOException {
+return null;

Review comment:
   In my current impl, addIndexes simply completes without adding anything 
if the API returns null. I now realize, this is misleading for the user. 
Callers of addIndexes may not know about the configured merge policy. I should 
probably throw an exception if I get a null here?
   On similar lines, what should we do if the API call had some CodecReaders, 
but findMerges returns an empty spec.? Should we error out? (Will add a log for 
this either way.)
   
   Re: letting the API return null, I see that other variants of `findMerges` 
in this policy, all return `null`, and all callers of findMerges seem to be 
doing the null checks. Personally, I prefer disallowing the null return values, 
but I wanted to keep this in sync with the other variants.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-03-08 Thread GitBox



vigyasharma commented on pull request #633:
URL: https://github.com/apache/lucene/pull/633#issuecomment-1062441177


   Thanks for reviewing this @mikemccand ! I've addressed the ones that had 
minor changes, and am working on the larger ones.
   
   > What remains to bring this out of WIP?
   
   There are 2 existing tests that are breaking:
   ```java
   - org.apache.lucene.index.TestAddIndexes.testAddIndicesWithSoftDeletes 
(:lucene:core)
   - org.apache.lucene.index.TestIndexSorting.testAddIndexesWithDeletions 
(:lucene:core)
   ```
   
   I've been looking at `testAddIndicesWithSoftDeletes`. IW does a 
`rollbackInternal()` on shutDown, where it is failing on this assert - `assert 
pendingNumDocs.get() == segmentInfos.totalMaxDoc()`. I'm guessing my code is 
reserving more docs than it should.. is that the right direction? 
   Here's a longer error stack trace - 
   ```java
   java.lang.AssertionError: pendingNumDocs 7 != 5 totalMaxDoc
  at 
__randomizedtesting.SeedInfo.seed([63C0554849BFDEC5:4F94AA0CFCBAFADD]:0)
  at 
org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2428)
  at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1334)
  at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1362)
  at org.apache.lucene.util.IOUtils.close(IOUtils.java:85)
  at org.apache.lucene.util.IOUtils.close(IOUtils.java:72)
  at 
org.apache.lucene.index.TestAddIndexes.testAddIndicesWithSoftDeletes(TestAddIndexes.java:1521)
   ```

   Outside of fixing existing tests, I want to add some new tests, especially 
around cases where one of the merges fails, causing the whole API to fail (i.e. 
test that transactionality was retained).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10448) MergeRateLimiter doesn't always limit instant rate.

2022-03-08 Thread kkewwei (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503238#comment-17503238
 ] 

kkewwei commented on LUCENE-10448:
--

In our product, the instant rate of writing is 200mb/s in a few cases(the 
statistics collection is the above method, there should be no problem), which 
is far more than the limited rate, In that case, we should pause 10ms+, but the 
pause is ignored. 

[~jpountz][~vigyas], very looking forward to your confirmation, if you think it 
is not a bug, I will close the issue.

> MergeRateLimiter doesn't always limit instant rate.
> ---
>
> Key: LUCENE-10448
> URL: https://issues.apache.org/jira/browse/LUCENE-10448
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 8.11.1
>Reporter: kkewwei
>Priority: Major
>
> We can see the code in *MergeRateLimiter*:
> {code:java}
> private long maybePause(long bytes, long curNS) throws 
> MergePolicy.MergeAbortedException {
>
> double rate = mbPerSec; 
> double secondsToPause = (bytes / 1024. / 1024.) / rate;
> long targetNS = lastNS + (long) (10 * secondsToPause);
> long curPauseNS = targetNS - curNS;
> // We don't bother with thread pausing if the pause is smaller than 2 
> msec.
> if (curPauseNS <= MIN_PAUSE_NS) {
>   // Set to curNS, not targetNS, to enforce the instant rate, not
>   // the "averaged over all history" rate:
>   lastNS = curNS;
>   return -1;
> }
>..
>   }
> {code}
> If a Segment is been merged, *maybePause* is called in 7:00, lastNS=7:00, 
> then the *maybePause* is called in 7:05 again,  so the value of 
> *targetNS=lastNS + (long) (10 * secondsToPause)* must be smaller than 
> *curNS*, no matter how big the bytes is, we will return -1 and ignore to 
> pause. 
> I count the total times(callTimes) calling *maybePause* and ignored pause 
> times(ignorePauseTimes) and detail ignored bytes(detailBytes):
> {code:java}
> [2022-03-02T15:16:51,972][DEBUG][o.e.i.e.I.EngineMergeScheduler] [node1] 
> [index1][21] merge segment [_4h] done: took [26.8s], [123.6 MB], [61,219 
> docs], [0s stopped], [24.4s throttled], [242.5 MB written], [11.2 MB/sec 
> throttle], [callTimes=857], [ignorePauseTimes=25],  [detailBytes(mb) = 
> [0.28899956, 0.28140354, 0.28015518, 0.27990818, 0.2801447, 0.27991104, 
> 0.27990723, 0.27990913, 0.2799101, 0.28010082, 0.2799921, 0.2799673, 
> 0.28144264, 0.27991295, 0.27990818, 0.27993107, 0.2799387, 0.27998447, 
> 0.28002167, 0.27992058, 0.27998066, 0.28098202, 0.28125, 0.28125, 0.28125]]
> {code}
> There are 857 times calling *maybePause*, including 25 times which is ignored 
> to pause, we can see that the ignored detail bytes (such as 0.28125mb) are 
> not small.
> As long as the interval between two *maybePause* calls is relatively long, 
> the pause action that should be executed will not be executed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links

2022-03-08 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503290#comment-17503290
 ] 

Robert Muir commented on LUCENE-10457:
--

{quote}
In what ways? If you pass a symlink to a directory to Files.isDirectory it will 
return true. You have to specifically ask Java for Link_Option.NOFOLLOW to get 
the distinction.
{quote}

I don't care what java does or doesn't do. It is so different to the OS, and we 
don't use it, so I don't want it confusing our tests. You can start with 
reading relevant manpages in posix or linux to get an idea of what kinds of 
issues you might run into: with any program, no need to think about java. These 
kinds of issues always stand out: I'm literally just pasting summaries of the 
shit listed in {{symlink(4)}}, which is just the introduction.

* can contain relative pointers
* can point to nothing at all (dangling link)
* interact with DAC permissions in strange ways
* have zero error checking when creating them (makes dangling links easy)
* interact with file deletion in strange ways

I'm not even gonna think about issues around lower-level OS issues such as 
synchronization to disk, where things always get hairy.
Then on the java side, we can imagine how all this crap translates:
* stuff like "getting FileStore to a symlink" will probably never work 
correctly.
* stupid methods like isReadable/isWritable might be buggy.
* dangling links may cause chaos across the APIs.

Sounds like you used symlinks without considering these issues and you got the 
punishment you deserved :) 
Instead of thinking "whoah, we got bit because we made a bad design, let's 
rethink that", instead now you are trying to insert symlink tests into all of 
lucene's tests (including stuff like build tools). And randomly!

I'm -1 because I know symlinks are different and I know the problems of 
symlinks and have suffered through them before. There is no need for us to do 
this because Solr made a bad decision.

> LuceneTestCase.createTempDir could randomly return symbolic links
> -
>
> Key: LUCENE-10457
> URL: https://issues.apache.org/jira/browse/LUCENE-10457
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Mike Drob
>Priority: Major
>
> When we are creating temporary directories to use for other Lucene functions, 
> we could occasionally provide symbolic links instead of direct references to 
> directories. If the system running tests doesn't support symbolic links, then 
> we should ignore this option.
> Providing links would be useful to test scenarios for example where users 
> have a symbolic link for the "current" index directory and then rotate that 
> over time but applications still use the same link.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links

2022-03-08 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503322#comment-17503322
 ] 

Dawid Weiss commented on LUCENE-10457:
--

On the one hand I'm with Mike – it should really work, symlink in path or not. 
On the other hang - I confirm what Robert says: on Windows links in the file 
system are even more complicated (try googling for reparse point, junctions)... 
It may be very difficult to make this randomized. Perhaps an explicit method 
createTempDirOrLink() would be better - this can be used explicitly from tests 
that cover tools and utilities (and not the core).

> LuceneTestCase.createTempDir could randomly return symbolic links
> -
>
> Key: LUCENE-10457
> URL: https://issues.apache.org/jira/browse/LUCENE-10457
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Mike Drob
>Priority: Major
>
> When we are creating temporary directories to use for other Lucene functions, 
> we could occasionally provide symbolic links instead of direct references to 
> directories. If the system running tests doesn't support symbolic links, then 
> we should ignore this option.
> Providing links would be useful to test scenarios for example where users 
> have a symbolic link for the "current" index directory and then rotate that 
> over time but applications still use the same link.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links

2022-03-08 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503322#comment-17503322
 ] 

Dawid Weiss edited comment on LUCENE-10457 at 3/9/22, 6:32 AM:
---

On the one hand I'm with Mike – it should really work, symlink in path or not. 
On the other hand - I confirm what Robert says: on Windows links in the file 
system are even more complicated (try googling for reparse point, junctions)... 
It may be very difficult to make this randomized. Perhaps an explicit method 
createTempDirOrLink() would be better - this can be used explicitly from tests 
that cover tools and utilities (and not the core).


was (Author: dweiss):
On the one hand I'm with Mike – it should really work, symlink in path or not. 
On the other hang - I confirm what Robert says: on Windows links in the file 
system are even more complicated (try googling for reparse point, junctions)... 
It may be very difficult to make this randomized. Perhaps an explicit method 
createTempDirOrLink() would be better - this can be used explicitly from tests 
that cover tools and utilities (and not the core).

> LuceneTestCase.createTempDir could randomly return symbolic links
> -
>
> Key: LUCENE-10457
> URL: https://issues.apache.org/jira/browse/LUCENE-10457
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Mike Drob
>Priority: Major
>
> When we are creating temporary directories to use for other Lucene functions, 
> we could occasionally provide symbolic links instead of direct references to 
> directories. If the system running tests doesn't support symbolic links, then 
> we should ignore this option.
> Providing links would be useful to test scenarios for example where users 
> have a symbolic link for the "current" index directory and then rotate that 
> over time but applications still use the same link.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mogui commented on a change in pull request #679: Monitor Improvements LUCENE-10422

2022-03-08 Thread GitBox



mogui commented on a change in pull request #679:
URL: https://github.com/apache/lucene/pull/679#discussion_r822356271



##
File path: 
lucene/monitor/src/java/org/apache/lucene/monitor/ReadonlyQueryIndex.java
##
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.monitor;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.SearcherManager;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IOUtils;
+
+class ReadonlyQueryIndex extends QueryIndex {
+
+  public ReadonlyQueryIndex(MonitorConfiguration configuration) throws 
IOException {
+if (configuration.getDirectoryProvider() == null) {
+  throw new IllegalStateException(
+  "You must specify a Directory when configuring a Monitor as 
read-only.");
+}
+Directory directory = configuration.getDirectoryProvider().get();
+this.queries = new HashMap<>();
+this.manager = new SearcherManager(directory, new 
TermsHashBuilder(termFilters));
+this.decomposer = configuration.getQueryDecomposer();
+this.serializer = configuration.getQuerySerializer();
+this.populateQueryCache(serializer, decomposer);
+  }
+
+  @Override
+  public void commit(List updates) throws IOException {
+throw new IllegalStateException("Monitor is readOnly cannot commit");
+  }
+
+  @Override
+  long search(final Query query, QueryCollector matcher) throws IOException {
+QueryBuilder builder = termFilter -> query;
+return search(builder, matcher);
+  }
+
+  @Override
+  public long search(QueryBuilder queryBuilder, QueryCollector matcher) throws 
IOException {
+IndexSearcher searcher = null;
+try {
+  searcher = manager.acquire();
+  return searchInMemory(queryBuilder, matcher, searcher, this.queries);
+} finally {
+  if (searcher != null) {
+manager.release(searcher);
+  }
+}
+  }
+
+  @Override
+  public void purgeCache() throws IOException {
+this.populateQueryCache(serializer, decomposer);
+lastPurged = System.nanoTime();
+  }
+
+  @Override
+  void purgeCache(CachePopulator populator) throws IOException {
+manager.maybeRefresh();

Review comment:
   @romseygeek didi had a chanche to look out the new commits?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mogui commented on a change in pull request #679: Monitor Improvements LUCENE-10422

2022-03-08 Thread GitBox



mogui commented on a change in pull request #679:
URL: https://github.com/apache/lucene/pull/679#discussion_r822356271



##
File path: 
lucene/monitor/src/java/org/apache/lucene/monitor/ReadonlyQueryIndex.java
##
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.monitor;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.SearcherManager;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IOUtils;
+
+class ReadonlyQueryIndex extends QueryIndex {
+
+  public ReadonlyQueryIndex(MonitorConfiguration configuration) throws 
IOException {
+if (configuration.getDirectoryProvider() == null) {
+  throw new IllegalStateException(
+  "You must specify a Directory when configuring a Monitor as 
read-only.");
+}
+Directory directory = configuration.getDirectoryProvider().get();
+this.queries = new HashMap<>();
+this.manager = new SearcherManager(directory, new 
TermsHashBuilder(termFilters));
+this.decomposer = configuration.getQueryDecomposer();
+this.serializer = configuration.getQuerySerializer();
+this.populateQueryCache(serializer, decomposer);
+  }
+
+  @Override
+  public void commit(List updates) throws IOException {
+throw new IllegalStateException("Monitor is readOnly cannot commit");
+  }
+
+  @Override
+  long search(final Query query, QueryCollector matcher) throws IOException {
+QueryBuilder builder = termFilter -> query;
+return search(builder, matcher);
+  }
+
+  @Override
+  public long search(QueryBuilder queryBuilder, QueryCollector matcher) throws 
IOException {
+IndexSearcher searcher = null;
+try {
+  searcher = manager.acquire();
+  return searchInMemory(queryBuilder, matcher, searcher, this.queries);
+} finally {
+  if (searcher != null) {
+manager.release(searcher);
+  }
+}
+  }
+
+  @Override
+  public void purgeCache() throws IOException {
+this.populateQueryCache(serializer, decomposer);
+lastPurged = System.nanoTime();
+  }
+
+  @Override
+  void purgeCache(CachePopulator populator) throws IOException {
+manager.maybeRefresh();

Review comment:
   @romseygeek did you had a chanche to look out the new commits?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] spyk commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

[GitHub] [lucene] msokolov merged pull request #735: Fix: typo + +minScore.

[jira] [Commented] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links

[jira] [Commented] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links

[GitHub] [lucene] magibney merged pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

[jira] [Commented] (LUCENE-10171) Caching issue on dictionary-based OpenNLPLemmatizerFilterFactory

[GitHub] [lucene] magibney commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

[jira] [Commented] (LUCENE-10171) Caching issue on dictionary-based OpenNLPLemmatizerFilterFactory

[jira] [Resolved] (LUCENE-10171) Caching issue on dictionary-based OpenNLPLemmatizerFilterFactory

[jira] [Updated] (LUCENE-10371) Make IndexRearranger able to arrange segment into a determined order

[jira] [Updated] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring

[jira] [Updated] (LUCENE-10339) Add require derective for jdk.management to core's module descriptor.

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

[GitHub] [lucene] vigyasharma commented on a change in pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

[GitHub] [lucene] vigyasharma commented on pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

[jira] [Commented] (LUCENE-10448) MergeRateLimiter doesn't always limit instant rate.

[jira] [Commented] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links

[jira] [Commented] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links

[jira] [Comment Edited] (LUCENE-10457) LuceneTestCase.createTempDir could randomly return symbolic links

[GitHub] [lucene] mogui commented on a change in pull request #679: Monitor Improvements LUCENE-10422

[GitHub] [lucene] mogui commented on a change in pull request #679: Monitor Improvements LUCENE-10422

28 matches

Site Navigation

Mail list logo

Footer information