[GitHub] [lucene] iverase opened a new pull request #72: LUCENE-9907: Remove packedInts#getReaderNoHeader dependency on TermsVectorFieldsFormat

2021-04-08 Thread GitBox


iverase opened a new pull request #72:
URL: https://github.com/apache/lucene/pull/72


   Replaces the usages of PackedInts#getReaderNoHeader with 
DirecReader#getInstance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9914) Modernize Emoji regeneration scripts

2021-04-08 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317121#comment-17317121
 ] 

Robert Muir commented on LUCENE-9914:
-

FYI: For the jflex we want the unicode version to match what the rest of the 
jflex grammar is using. Sometimes new unicode versions have features that 
require new jflex versions.

So we may want to add something like the following to the script to make it 
clear what version it was generated with:

{code}
import com.ibm.icu.lang.UCharacter;
import com.ibm.icu.util.VersionInfo;

System.out.println("// Unicode Version: " + UCharacter.getUnicodeVersion());
System.out.println("// ICU Version: " + VersionInfo.ICU_VERSION);
{code}


> Modernize Emoji regeneration scripts
> 
>
> Key: LUCENE-9914
> URL: https://issues.apache.org/jira/browse/LUCENE-9914
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>
> These are perl scripts... I don't think they had ant tasks in 8x and they 
> haven't been used in a while. They don't seem too scary (for perl) - just 
> fetch emoji unicode descriptions and parse them into a jflex macro and a test 
> case.
> It'd be good to convert them to use python, groovy or even java so that they 
> fit better in the build system. Alternatively - perhaps there is a way to get 
> these codepoint properties from Java directly?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9913) TestCompressingTermVectorsFormat.testMergeStability can fail assertion

2021-04-08 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317137#comment-17317137
 ] 

Robert Muir commented on LUCENE-9913:
-

[~julietibs] I haven't dug into this test/failure yet, but it might be due to 
LUCENE-9827 merge compression changes to stored fields & vectors. 

The maxChunkSize parameter passed to the compression is now used as part of the 
decision about whether or not recompression happens at merge, and it wasn't 
used here before. So perhaps it confuses tests depending on various parameters.

> TestCompressingTermVectorsFormat.testMergeStability can fail assertion
> --
>
> Key: LUCENE-9913
> URL: https://issues.apache.org/jira/browse/LUCENE-9913
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Julie Tibshirani
>Priority: Major
>
> This reproduces for me on {{main}}:
> {code:java}
> ./gradlew test --tests TestCompressingTermVectorsFormat.testMergeStability \
>   -Dtests.seed=502C0E17C8769082 -Dtests.nightly=true \
>   -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=gd-GB \
>   -Dtests.timezone=Africa/Accra -Dtests.asserts=true \
>   -Dtests.file.encoding=UTF-8
> {code}
> Failure excerpt:
> {code:java}
>  > java.lang.AssertionError: expected:<{tvd=33526, fnm=698, nvm=283, 
> tvm=164, tmd=826, fdm=158, pos=10508, fdt=1121, tvx=339, doc=13302, 
> tim=22354, tip=101, fdx=202, nvd=18983}> but was:<{tvd=33526, fnm=698, 
> nvm=283, tvm=163, tmd=826, fdm=157, pos=10508, fdt=1121, tvx=339, doc=13302, 
> tim=22354, tip=101, fdx=202, nvd=18983}>
>> at 
> __randomizedtesting.SeedInfo.seed([502C0E17C8769082:24604838C59C9234]:0)
>> at org.junit.Assert.fail(Assert.java:89)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9914) Modernize Emoji regeneration scripts

2021-04-08 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317138#comment-17317138
 ] 

Uwe Schindler commented on LUCENE-9914:
---

See the groovy script that creates the file UnicodeData.java. Dawid touched it 
a few days ago.

https://github.com/apache/lucene/blob/fbf9191abf2ad4acd26bae16e075cdeb79d33a39/gradle/generation/unicode-data.gradle

Uwe

> Modernize Emoji regeneration scripts
> 
>
> Key: LUCENE-9914
> URL: https://issues.apache.org/jira/browse/LUCENE-9914
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>
> These are perl scripts... I don't think they had ant tasks in 8x and they 
> haven't been used in a while. They don't seem too scary (for perl) - just 
> fetch emoji unicode descriptions and parse them into a jflex macro and a test 
> case.
> It'd be good to convert them to use python, groovy or even java so that they 
> fit better in the build system. Alternatively - perhaps there is a way to get 
> these codepoint properties from Java directly?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #70: LUCENE-9911: enable ecjLint unusedExceptionParameter

2021-04-08 Thread GitBox


rmuir merged pull request #70:
URL: https://github.com/apache/lucene/pull/70


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9911) enable ecjLint unusedExceptionParameter

2021-04-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317140#comment-17317140
 ] 

ASF subversion and git services commented on LUCENE-9911:
-

Commit 2971f311a2b4a9139e3a74edbe76b08bc0e288a3 in lucene's branch 
refs/heads/main from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2971f31 ]

LUCENE-9911: enable ecjLint unusedExceptionParameter (#70)

Fails the linter if an exception is swallowed (e.g. variable completely
unused).

If this is intentional for some reason, the exception can simply by
annotated with @SuppressWarnings("unused").

> enable ecjLint unusedExceptionParameter
> ---
>
> Key: LUCENE-9911
> URL: https://issues.apache.org/jira/browse/LUCENE-9911
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> unusedExceptionParameter is a very useful check, as it detects if you catch 
> an exception and do nothing with it at all.
> As a library, its important to preserve exceptions (e.g. chain the root 
> cause, .addSuppressed, etc). This check helps prevent exceptions from getting 
> swallowed inadvertently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9911) enable ecjLint unusedExceptionParameter

2021-04-08 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-9911.
-
Fix Version/s: main (9.0)
   Resolution: Fixed

> enable ecjLint unusedExceptionParameter
> ---
>
> Key: LUCENE-9911
> URL: https://issues.apache.org/jira/browse/LUCENE-9911
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Fix For: main (9.0)
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> unusedExceptionParameter is a very useful check, as it detects if you catch 
> an exception and do nothing with it at all.
> As a library, its important to preserve exceptions (e.g. chain the root 
> cause, .addSuppressed, etc). This check helps prevent exceptions from getting 
> swallowed inadvertently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9914) Modernize Emoji regeneration scripts

2021-04-08 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317145#comment-17317145
 ] 

Robert Muir commented on LUCENE-9914:
-

yes, that one looks great: I think a similar groovy can work here (using above 
snippet). We just have to use icu 62 for now so that we get unicode 11 property 
data to match the version of unicode that jflex grammar uses (I think it only 
makes sense for the whole grammar to be self-consistent with respect to that, 
we shouldn't mix and match).

FYI, that one could be done in a similar more efficient way with UnicodeSet on 
the "White_Space" property as well, rather than looping thru every codepoint. 
But maybe it is fast enough that no one cares :)

> Modernize Emoji regeneration scripts
> 
>
> Key: LUCENE-9914
> URL: https://issues.apache.org/jira/browse/LUCENE-9914
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>
> These are perl scripts... I don't think they had ant tasks in 8x and they 
> haven't been used in a while. They don't seem too scary (for perl) - just 
> fetch emoji unicode descriptions and parse them into a jflex macro and a test 
> case.
> It'd be good to convert them to use python, groovy or even java so that they 
> fit better in the build system. Alternatively - perhaps there is a way to get 
> these codepoint properties from Java directly?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9914) Modernize Emoji regeneration scripts

2021-04-08 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317153#comment-17317153
 ] 

Dawid Weiss commented on LUCENE-9914:
-

It does get a bit more complex if we want to use multiple ICU versions -- there 
can be only one referenced directly from within build scripts. Having multiple 
versions requires a separate configuration/ dependency and java fork with a 
different classpath. Not terribly difficult but definitely adding a layer of 
complexity. I'll take a look.

> Modernize Emoji regeneration scripts
> 
>
> Key: LUCENE-9914
> URL: https://issues.apache.org/jira/browse/LUCENE-9914
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>
> These are perl scripts... I don't think they had ant tasks in 8x and they 
> haven't been used in a while. They don't seem too scary (for perl) - just 
> fetch emoji unicode descriptions and parse them into a jflex macro and a test 
> case.
> It'd be good to convert them to use python, groovy or even java so that they 
> fit better in the build system. Alternatively - perhaps there is a way to get 
> these codepoint properties from Java directly?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #71: LUCENE-9651: Make benchmarks run again, correct javadocs

2021-04-08 Thread GitBox


dweiss commented on pull request #71:
URL: https://github.com/apache/lucene/pull/71#issuecomment-815782744


   Thanks Robert. I'll go through these benchmark files and correct them so 
that they work. It is a bit worrying that nobody noticed they're broken. :) 
Anybody using these at all?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9914) Modernize Emoji regeneration scripts

2021-04-08 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317155#comment-17317155
 ] 

Uwe Schindler commented on LUCENE-9914:
---

That's really fast. The reason why I did it like that was that ICU should be no 
runtime dependency, so it is just extracting data and providing it to 
CharTokenizer as a Bits interface (backed by a sparse bitset). The script only 
takes milliseconds. 😜

Maybe we can just extend the class UnicodeData to contain Emoji codepoints in a 
similar way and let the jflex code depend on it.

Because of my bad experience with the domain name tokenizer, I tend to think 
that the FSA should only contain some "best guess" like unicode ranges so FSA 
is small. In the jflex callback the lookup of exact emoji could be done and 
everything which is not emoji handled back to jflex as no match.

IMHO for the domain name standard tokenizer it should maybe done similar: just 
match anything that looks like a domain and do a separate check on possible 
matches.

> Modernize Emoji regeneration scripts
> 
>
> Key: LUCENE-9914
> URL: https://issues.apache.org/jira/browse/LUCENE-9914
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>
> These are perl scripts... I don't think they had ant tasks in 8x and they 
> haven't been used in a while. They don't seem too scary (for perl) - just 
> fetch emoji unicode descriptions and parse them into a jflex macro and a test 
> case.
> It'd be good to convert them to use python, groovy or even java so that they 
> fit better in the build system. Alternatively - perhaps there is a way to get 
> these codepoint properties from Java directly?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9916) generateUnicodeProps doesn't work according to instructions, always SKIPPED

2021-04-08 Thread Robert Muir (Jira)
Robert Muir created LUCENE-9916:
---

 Summary: generateUnicodeProps doesn't work according to 
instructions, always SKIPPED
 Key: LUCENE-9916
 URL: https://issues.apache.org/jira/browse/LUCENE-9916
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir


I tried to regenerate unicode properties mentioned in LUCENE-9914 by 
[~uschindler] and I simply can't get it to run at all.

It says in the output file:
{code}
// DO NOT EDIT THIS FILE! Use "gradlew generateUnicodeProps tidy" to recreate.
{code}

Here is what I see:
{noformat}
./gradlew clean
../gradlew generateUnicodeProps tidy

...
Task :lucene:analysis:common:generateUnicodeProps SKIPPED
{noformat}

Even if i remove the output file completely: {{rm 
lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java}},
 the task is always skipped. How to regenerate? cc [~dweiss]




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9916) generateUnicodeProps doesn't work according to instructions, always SKIPPED

2021-04-08 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317159#comment-17317159
 ] 

Dawid Weiss commented on LUCENE-9916:
-

Run with --rerun-tasks if you want to force regeneration. It should still run 
if you remove (or touch) one of the inputs/outputs - I'll take a look.

> generateUnicodeProps doesn't work according to instructions, always SKIPPED
> ---
>
> Key: LUCENE-9916
> URL: https://issues.apache.org/jira/browse/LUCENE-9916
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> I tried to regenerate unicode properties mentioned in LUCENE-9914 by 
> [~uschindler] and I simply can't get it to run at all.
> It says in the output file:
> {code}
> // DO NOT EDIT THIS FILE! Use "gradlew generateUnicodeProps tidy" to recreate.
> {code}
> Here is what I see:
> {noformat}
> ./gradlew clean
> ../gradlew generateUnicodeProps tidy
> ...
> Task :lucene:analysis:common:generateUnicodeProps SKIPPED
> {noformat}
> Even if i remove the output file completely: {{rm 
> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java}},
>  the task is always skipped. How to regenerate? cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


gsmiller commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609667950



##
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
   in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
 }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+// This loop gets auto-vectorized
+for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+  longs[i] += base;
+}
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+  longs[i] = (i + 1) * val + base;

Review comment:
   @rmuir instance type swap did the trick. Thanks so much! Grabbing your 
command from above, here's the suspicious difference (in case this helps in the 
future):
   * instance type: m4.10xlarge
   ```
   m4.10xl% sudo journalctl -k | grep PMU   

 
   Mar 24 15:02:23 localhost kernel: Performance Events: unsupported p6 CPU 
model 63 no PMU driver, software events only.
   Mar 24 15:02:23 localhost kernel: RAPL PMU: API unit is 2^-32 Joules, 0 
fixed counters, 655360 ms ovfl timer
   ```
   * instance type: m5.12xlarge
   ```
   m5.12xl% sudo journalctl -k | grep PMU   
 
   Apr 08 04:02:43 localhost kernel: Performance Events: Skylake events, Intel 
PMU driver.
   Apr 08 04:02:43 localhost kernel: RAPL PMU: API unit is 2^-32 Joules, 0 
fixed counters, 10737418240 ms ovfl timer
   ```
   I've confirmed `perfasm` is working for me. I'll get some results updated 
here shortly. Thanks again!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9872) Make the most painful tasks in regenerate fully incremental

2021-04-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317169#comment-17317169
 ] 

ASF subversion and git services commented on LUCENE-9872:
-

Commit 4c2384a1f352094a2f208dd354240f56e782da1d in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4c2384a ]

LUCENE-9872: load input/output checksums prior to executing the target task, 
even if regenerate is not called.


> Make the most painful tasks in regenerate fully incremental
> ---
>
> Key: LUCENE-9872
> URL: https://issues.apache.org/jira/browse/LUCENE-9872
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: main (9.0)
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: main (9.0)
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is particularly important for that one jflex task that is currently 
> mood-killer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9916) generateUnicodeProps doesn't work according to instructions, always SKIPPED

2021-04-08 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317173#comment-17317173
 ] 

Dawid Weiss commented on LUCENE-9916:
-

I corrected the checksum configuration code - it's a bug. There's, sadly, a lot 
of trickery involved in making these "checksums" work because gradle task 
dependencies are much more relaxed than ant's - they can execute out of order 
and there is no mechanism to "skip" a task AND its dependencies (which makes 
sense since it's a graph and task A's dependencies can be a non-ignored task 
B's dependencies...).

I don't know of a simpler way to do it though.

> generateUnicodeProps doesn't work according to instructions, always SKIPPED
> ---
>
> Key: LUCENE-9916
> URL: https://issues.apache.org/jira/browse/LUCENE-9916
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> I tried to regenerate unicode properties mentioned in LUCENE-9914 by 
> [~uschindler] and I simply can't get it to run at all.
> It says in the output file:
> {code}
> // DO NOT EDIT THIS FILE! Use "gradlew generateUnicodeProps tidy" to recreate.
> {code}
> Here is what I see:
> {noformat}
> ./gradlew clean
> ../gradlew generateUnicodeProps tidy
> ...
> Task :lucene:analysis:common:generateUnicodeProps SKIPPED
> {noformat}
> Even if i remove the output file completely: {{rm 
> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java}},
>  the task is always skipped. How to regenerate? cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9916) generateUnicodeProps doesn't work according to instructions, always SKIPPED

2021-04-08 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-9916.
-
Resolution: Fixed

> generateUnicodeProps doesn't work according to instructions, always SKIPPED
> ---
>
> Key: LUCENE-9916
> URL: https://issues.apache.org/jira/browse/LUCENE-9916
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> I tried to regenerate unicode properties mentioned in LUCENE-9914 by 
> [~uschindler] and I simply can't get it to run at all.
> It says in the output file:
> {code}
> // DO NOT EDIT THIS FILE! Use "gradlew generateUnicodeProps tidy" to recreate.
> {code}
> Here is what I see:
> {noformat}
> ./gradlew clean
> ../gradlew generateUnicodeProps tidy
> ...
> Task :lucene:analysis:common:generateUnicodeProps SKIPPED
> {noformat}
> Even if i remove the output file completely: {{rm 
> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java}},
>  the task is always skipped. How to regenerate? cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9916) generateUnicodeProps doesn't work according to instructions, always SKIPPED

2021-04-08 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317175#comment-17317175
 ] 

Dawid Weiss commented on LUCENE-9916:
-

The "convention" to force-run any task(s) is to pass --rerun-tasks, by the way. 
The regeneration code does respect it. It's not a selective option (it'll 
really rerun everything).

> generateUnicodeProps doesn't work according to instructions, always SKIPPED
> ---
>
> Key: LUCENE-9916
> URL: https://issues.apache.org/jira/browse/LUCENE-9916
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> I tried to regenerate unicode properties mentioned in LUCENE-9914 by 
> [~uschindler] and I simply can't get it to run at all.
> It says in the output file:
> {code}
> // DO NOT EDIT THIS FILE! Use "gradlew generateUnicodeProps tidy" to recreate.
> {code}
> Here is what I see:
> {noformat}
> ./gradlew clean
> ../gradlew generateUnicodeProps tidy
> ...
> Task :lucene:analysis:common:generateUnicodeProps SKIPPED
> {noformat}
> Even if i remove the output file completely: {{rm 
> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java}},
>  the task is always skipped. How to regenerate? cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #71: LUCENE-9651: Make benchmarks run again, correct javadocs

2021-04-08 Thread GitBox


rmuir commented on pull request #71:
URL: https://github.com/apache/lucene/pull/71#issuecomment-815809836


   > Thanks Robert. I'll go through these benchmark files and correct them so 
that they work. It is a bit worrying that nobody noticed they're broken. :) 
Anybody using these at all?
   
   I've not used this mechanism of the benchmark to do any performance 
benchmarking: It seems most performance benchmarking from 
contributors/committers is using https://github.com/mikemccand/luceneutil for 
this, or writing ad-hoc benchmarks. 
   
   Personally, I use this benchmarking package, but via QualityRun's main 
method,  to measure relevance, and I always write my own parser (because every 
trec-like dataset differs oh-so-slightly and the generic TREC parser we supply 
never works), and I just hold it in a minimum way (generate submission.txt, 
then i run trec_eval etc from commandline myself).
   
   The issue why it isn't used might be the dataset, I'm unfamiliar with this 
reuters dataset and maybe its not big enough for useful benchmarks? I think in 
general people tend to use these datasets more often for performance 
benchmarks, often ad-hoc:
   * wikipedia english
   * geonames
   * apache httpd logs
   * NYC Taxis
   * OpenStreetMap
   
   Or maybe its just because perf issues are usually complicated? For example 
to reproduce LUCENE-9827 I downloaded geonames and wrote a simple standalone 
.java Indexer (attached to issue) that essentially changes IW's config (flush 
every doc, SerialMergeScheduler, LZ4 and DEFLATE codec compression) to keep it 
simple measuring using only a single thread. It ran so slow i had to limit the 
number of docs to the first N as well.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9916) generateUnicodeProps doesn't work according to instructions, always SKIPPED

2021-04-08 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317192#comment-17317192
 ] 

Robert Muir commented on LUCENE-9916:
-

[~dweiss] I realize this looks like just a generic gradle feature, but I didn't 
know about it. Maybe a good one for help/ ?

The two cases where its would have been useful to me so far is:
1. re-running a test with the same seed.
2. trying to force-regenerate content here.

So maybe at least it could be a little one-liner in help/tests.txt and possibly 
a future help/regeneration.txt. I can followup with it

> generateUnicodeProps doesn't work according to instructions, always SKIPPED
> ---
>
> Key: LUCENE-9916
> URL: https://issues.apache.org/jira/browse/LUCENE-9916
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> I tried to regenerate unicode properties mentioned in LUCENE-9914 by 
> [~uschindler] and I simply can't get it to run at all.
> It says in the output file:
> {code}
> // DO NOT EDIT THIS FILE! Use "gradlew generateUnicodeProps tidy" to recreate.
> {code}
> Here is what I see:
> {noformat}
> ./gradlew clean
> ../gradlew generateUnicodeProps tidy
> ...
> Task :lucene:analysis:common:generateUnicodeProps SKIPPED
> {noformat}
> Even if i remove the output file completely: {{rm 
> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java}},
>  the task is always skipped. How to regenerate? cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a change in pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


rmuir commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609691366



##
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
   in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
 }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+// This loop gets auto-vectorized
+for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+  longs[i] += base;
+}
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+  longs[i] = (i + 1) * val + base;

Review comment:
   @gsmiller glad to hear you are up and running! We need more eyes on this 
stuff, and they don't exactly make it easy!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9916) generateUnicodeProps doesn't work according to instructions, always SKIPPED

2021-04-08 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317196#comment-17317196
 ] 

Dawid Weiss commented on LUCENE-9916:
-

I think tests.txt mentions the convention cleanTest to rerun with the same seed 
(which works). I agree --rerun-tasks is sometimes useful and I use it myself. 
The "cleanTaskName" convention is more convenient if you don't want to rebuild 
the world. Please go ahead and commit a clarification to the docs - it's better 
if it comes from you than me.

> generateUnicodeProps doesn't work according to instructions, always SKIPPED
> ---
>
> Key: LUCENE-9916
> URL: https://issues.apache.org/jira/browse/LUCENE-9916
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> I tried to regenerate unicode properties mentioned in LUCENE-9914 by 
> [~uschindler] and I simply can't get it to run at all.
> It says in the output file:
> {code}
> // DO NOT EDIT THIS FILE! Use "gradlew generateUnicodeProps tidy" to recreate.
> {code}
> Here is what I see:
> {noformat}
> ./gradlew clean
> ../gradlew generateUnicodeProps tidy
> ...
> Task :lucene:analysis:common:generateUnicodeProps SKIPPED
> {noformat}
> Even if i remove the output file completely: {{rm 
> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java}},
>  the task is always skipped. How to regenerate? cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


gsmiller commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609697557



##
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
   in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
 }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+// This loop gets auto-vectorized
+for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+  longs[i] += base;
+}
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+  longs[i] = (i + 1) * val + base;

Review comment:
   Here's what I've found with `perfasm` on this microbenchmark branch:
   1. The `prefixSumOf` method in question [1] is _not_ auto-vectorizing. The 
assembly loop is below [2].
   2. If I change the implementation of `prefixSumOf` to use two loops [3], the 
second "add" loop is auto-vectoring in the same way that `prefixSumOfOnes` does 
[4], but the first "multiply" loop does not [5].
   3. Even though the second approach [3] gets partially vectorized, it's 
significantly less performant than the vanilla, single-loop approach [6].
   
   [1]
   ```
   private static void prefixSumOf(long val, long[] arr, long base) {
   for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
   arr[i] = IDENTITY_PLUS_ONE[i] * val + base;
   }
   }
   ```
   [2]
   ```
 0.45%β†—  0x7f37dfa02c52: mov%r9,%r8
 0.30%β”‚  0x7f37dfa02c55: movabs $0xd2816340,%rdi   ;   
{oop([J{0xd2816340})}
 3.16%β”‚  0x7f37dfa02c5f: imul   0x10(%rdi,%r10,8),%r8
 1.12%β”‚  0x7f37dfa02c65: add%rcx,%r8
 1.42%β”‚  0x7f37dfa02c68: mov%r8,0x10(%rbx,%r10,8)
 2.97%β”‚  0x7f37dfa02c6d: mov%r9,%r8
 2.62%β”‚  0x7f37dfa02c70: imul   0x18(%rdi,%r10,8),%r8
 1.37%β”‚  0x7f37dfa02c76: add%rcx,%r8
 1.40%β”‚  0x7f37dfa02c79: mov%r8,0x18(%rbx,%r10,8)
 5.71%β”‚  0x7f37dfa02c7e: mov%r9,%r8
 2.02%β”‚  0x7f37dfa02c81: imul   0x20(%rdi,%r10,8),%r8
 1.08%β”‚  0x7f37dfa02c87: add%rcx,%r8
 1.91%β”‚  0x7f37dfa02c8a: mov%r8,0x20(%rbx,%r10,8)
 4.96%β”‚  0x7f37dfa02c8f: mov%r9,%r8
 1.57%β”‚  0x7f37dfa02c92: imul   0x28(%rdi,%r10,8),%r8
 0.71%β”‚  0x7f37dfa02c98: add%rcx,%r8
 0.56%β”‚  0x7f37dfa02c9b: mov%r8,0x28(%rbx,%r10,8)  ;*lastore 
{reexecute=0 rethrow=0 return_oop=0}
  β”‚; - 
jpountz.PForDeltaDecoder::prefixSumOf@24 (line 29)
  β”‚; - 
jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59)
  β”‚; - 
jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29)
  β”‚; - 
jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151
 (line 240)
 4.79%β”‚  0x7f37dfa02ca0: add$0x4,%r10d ;*iinc 
{reexecute=0 rethrow=0 return_oop=0}
  β”‚; - 
jpountz.PForDeltaDecoder::prefixSumOf@25 (line 28)
  β”‚; - 
jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59)
  β”‚; - 
jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29)
  β”‚; - 
jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151
 (line 240)
 1.22%β”‚  0x7f37dfa02ca4: cmp$0x7d,%r10d
  β•°  0x7f37dfa02ca8: jl 0x7f37dfa02c52  ;*if_icmpge 
{reexecute=0 rethrow=0 return_oop=0}
   ```
   [3]
   ```
   private static void prefixSumOfTwoLoops(long val, long[] arr, long base) 
{
   System.arraycopy(IDENTITY_PLUS_ONE, 0, arr, 0, ForUtil.BLOCK_SIZE);
   for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
   arr[i] *= val;
   }
   for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
   arr[i] += base;
   }
   }
   ```
   [4]
   ```
 0.11%   β†—0x7f1607a05810: vpaddq 0x10(%rbp,%r11,8),%ymm0,%ymm1
 0.17%   β”‚0x7f16

[GitHub] [lucene] gsmiller commented on a change in pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


gsmiller commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609699872



##
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
   in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
 }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+// This loop gets auto-vectorized
+for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+  longs[i] += base;
+}
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+  longs[i] = (i + 1) * val + base;

Review comment:
   So, as of now, I think we leave the implementation as is and hope that 
we can do something better with more explicit vectorization support in the 
future. @jpountz / @rmuir does that seem right to you? If you have any 
suggestions on other was to try to trick this compiler, I'm happy to try them 
out. And I know you'll call it out if you see something off in my above 
analysis, since I'm so new to this :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


gsmiller commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609697557



##
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
   in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
 }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+// This loop gets auto-vectorized
+for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+  longs[i] += base;
+}
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+  longs[i] = (i + 1) * val + base;

Review comment:
   Here's what I've found with `perfasm` on [this microbenchmark 
branch](https://github.com/gsmiller/decode-128-ints-benchmark/tree/pfor-is-it-vectorizing):
   1. The `prefixSumOf` method in question [1] is _not_ auto-vectorizing. The 
assembly loop is below [2].
   2. If I change the implementation of `prefixSumOf` to use two loops [3], the 
second "add" loop is auto-vectoring in the same way that `prefixSumOfOnes` does 
[4], but the first "multiply" loop does not [5].
   3. Even though the second approach [3] gets partially vectorized, it's 
significantly less performant than the vanilla, single-loop approach [6].
   
   [1]
   ```
   private static void prefixSumOf(long val, long[] arr, long base) {
   for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
   arr[i] = IDENTITY_PLUS_ONE[i] * val + base;
   }
   }
   ```
   [2]
   ```
 0.45%β†—  0x7f37dfa02c52: mov%r9,%r8
 0.30%β”‚  0x7f37dfa02c55: movabs $0xd2816340,%rdi   ;   
{oop([J{0xd2816340})}
 3.16%β”‚  0x7f37dfa02c5f: imul   0x10(%rdi,%r10,8),%r8
 1.12%β”‚  0x7f37dfa02c65: add%rcx,%r8
 1.42%β”‚  0x7f37dfa02c68: mov%r8,0x10(%rbx,%r10,8)
 2.97%β”‚  0x7f37dfa02c6d: mov%r9,%r8
 2.62%β”‚  0x7f37dfa02c70: imul   0x18(%rdi,%r10,8),%r8
 1.37%β”‚  0x7f37dfa02c76: add%rcx,%r8
 1.40%β”‚  0x7f37dfa02c79: mov%r8,0x18(%rbx,%r10,8)
 5.71%β”‚  0x7f37dfa02c7e: mov%r9,%r8
 2.02%β”‚  0x7f37dfa02c81: imul   0x20(%rdi,%r10,8),%r8
 1.08%β”‚  0x7f37dfa02c87: add%rcx,%r8
 1.91%β”‚  0x7f37dfa02c8a: mov%r8,0x20(%rbx,%r10,8)
 4.96%β”‚  0x7f37dfa02c8f: mov%r9,%r8
 1.57%β”‚  0x7f37dfa02c92: imul   0x28(%rdi,%r10,8),%r8
 0.71%β”‚  0x7f37dfa02c98: add%rcx,%r8
 0.56%β”‚  0x7f37dfa02c9b: mov%r8,0x28(%rbx,%r10,8)  ;*lastore 
{reexecute=0 rethrow=0 return_oop=0}
  β”‚; - 
jpountz.PForDeltaDecoder::prefixSumOf@24 (line 29)
  β”‚; - 
jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59)
  β”‚; - 
jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29)
  β”‚; - 
jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151
 (line 240)
 4.79%β”‚  0x7f37dfa02ca0: add$0x4,%r10d ;*iinc 
{reexecute=0 rethrow=0 return_oop=0}
  β”‚; - 
jpountz.PForDeltaDecoder::prefixSumOf@25 (line 28)
  β”‚; - 
jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59)
  β”‚; - 
jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29)
  β”‚; - 
jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151
 (line 240)
 1.22%β”‚  0x7f37dfa02ca4: cmp$0x7d,%r10d
  β•°  0x7f37dfa02ca8: jl 0x7f37dfa02c52  ;*if_icmpge 
{reexecute=0 rethrow=0 return_oop=0}
   ```
   [3]
   ```
   private static void prefixSumOfTwoLoops(long val, long[] arr, long base) 
{
   System.arraycopy(IDENTITY_PLUS_ONE, 0, arr, 0, ForUtil.BLOCK_SIZE);
   for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
   arr[i] *= val;
   }
   for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
   arr[i] += base;
   }
   }
   ```
   [4]
   ```
 0.11%   β†—0x0

[GitHub] [lucene] gsmiller commented on a change in pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


gsmiller commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609697557



##
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
   in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
 }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+// This loop gets auto-vectorized
+for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+  longs[i] += base;
+}
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+  longs[i] = (i + 1) * val + base;

Review comment:
   Here's what I've found with `perfasm` on [this microbenchmark 
branch](https://github.com/gsmiller/decode-128-ints-benchmark/tree/pfor-is-it-vectorizing):
   1. The `prefixSumOf` method in question [1] is _not_ auto-vectorizing. The 
assembly loop is below [2].
   2. If I change the implementation of `prefixSumOf` to use two loops [3], the 
second "add" loop is auto-vectoring in the same way that `prefixSumOfOnes` does 
[4], but the first "multiply" loop does not [5].
   3. Even though the second approach [3] gets partially vectorized, it's 
significantly less performant than the vanilla, single-loop approach (7.1 
throughput vs. 6.3) [6].
   
   [1]
   ```
   private static void prefixSumOf(long val, long[] arr, long base) {
   for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
   arr[i] = IDENTITY_PLUS_ONE[i] * val + base;
   }
   }
   ```
   [2]
   ```
 0.45%β†—  0x7f37dfa02c52: mov%r9,%r8
 0.30%β”‚  0x7f37dfa02c55: movabs $0xd2816340,%rdi   ;   
{oop([J{0xd2816340})}
 3.16%β”‚  0x7f37dfa02c5f: imul   0x10(%rdi,%r10,8),%r8
 1.12%β”‚  0x7f37dfa02c65: add%rcx,%r8
 1.42%β”‚  0x7f37dfa02c68: mov%r8,0x10(%rbx,%r10,8)
 2.97%β”‚  0x7f37dfa02c6d: mov%r9,%r8
 2.62%β”‚  0x7f37dfa02c70: imul   0x18(%rdi,%r10,8),%r8
 1.37%β”‚  0x7f37dfa02c76: add%rcx,%r8
 1.40%β”‚  0x7f37dfa02c79: mov%r8,0x18(%rbx,%r10,8)
 5.71%β”‚  0x7f37dfa02c7e: mov%r9,%r8
 2.02%β”‚  0x7f37dfa02c81: imul   0x20(%rdi,%r10,8),%r8
 1.08%β”‚  0x7f37dfa02c87: add%rcx,%r8
 1.91%β”‚  0x7f37dfa02c8a: mov%r8,0x20(%rbx,%r10,8)
 4.96%β”‚  0x7f37dfa02c8f: mov%r9,%r8
 1.57%β”‚  0x7f37dfa02c92: imul   0x28(%rdi,%r10,8),%r8
 0.71%β”‚  0x7f37dfa02c98: add%rcx,%r8
 0.56%β”‚  0x7f37dfa02c9b: mov%r8,0x28(%rbx,%r10,8)  ;*lastore 
{reexecute=0 rethrow=0 return_oop=0}
  β”‚; - 
jpountz.PForDeltaDecoder::prefixSumOf@24 (line 29)
  β”‚; - 
jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59)
  β”‚; - 
jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29)
  β”‚; - 
jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151
 (line 240)
 4.79%β”‚  0x7f37dfa02ca0: add$0x4,%r10d ;*iinc 
{reexecute=0 rethrow=0 return_oop=0}
  β”‚; - 
jpountz.PForDeltaDecoder::prefixSumOf@25 (line 28)
  β”‚; - 
jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59)
  β”‚; - 
jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29)
  β”‚; - 
jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151
 (line 240)
 1.22%β”‚  0x7f37dfa02ca4: cmp$0x7d,%r10d
  β•°  0x7f37dfa02ca8: jl 0x7f37dfa02c52  ;*if_icmpge 
{reexecute=0 rethrow=0 return_oop=0}
   ```
   [3]
   ```
   private static void prefixSumOfTwoLoops(long val, long[] arr, long base) 
{
   System.arraycopy(IDENTITY_PLUS_ONE, 0, arr, 0, ForUtil.BLOCK_SIZE);
   for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
   arr[i] *= val;
   }
   for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
   arr[i] += base;
   }
   }
   ```
   [4]
   ```

[GitHub] [lucene] rmuir opened a new pull request #73: LUCENE-9916: add a simple regeneration help doc

2021-04-08 Thread GitBox


rmuir opened a new pull request #73:
URL: https://github.com/apache/lucene/pull/73


   This probably isn't most efficient or the best, but its a start.
   
   Some notes:
   * Using these steps to "force regenerate" results in local diffs. These look 
to be hashmap ordering differences or similar. We should fix these so that 
regeneration is fully idempotent?
   * Might not be the most efficient, for example when using `--rerun-tasks` 
the tidy is unnecessarily rerun even if its not necessary, which is actually 
quite slow. Is the `tidy` task really necessary or is it automatically/more 
efficiently done as some prerequisite of `regenerate`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


gsmiller commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609697557



##
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
   in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
 }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+// This loop gets auto-vectorized
+for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+  longs[i] += base;
+}
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+  longs[i] = (i + 1) * val + base;

Review comment:
   Here's what I've found with `perfasm` on [this microbenchmark 
branch](https://github.com/gsmiller/decode-128-ints-benchmark/tree/pfor-is-it-vectorizing):
   1. The `prefixSumOf` method in question [1] is _not_ auto-vectorizing. The 
assembly loop is below [2].
   2. If I change the implementation of `prefixSumOf` to use two loops [3], the 
second "add" loop is auto-vectoring in the same way that `prefixSumOfOnes` does 
[4], but the first "multiply" loop does not [5].
   3. Even though the second approach [3] gets partially vectorized, it's 
significantly less performant than the vanilla, single-loop approach (7.1 
throughput vs. 6.3) [6].
   4. The full output of the jmh benchmark runs with `perfasm` are here (note 
that the first run in each of these is `prefixSumOfOnes` as a baseline, 
controlled by `sameVal == 1` instead of `sameVal == 2`; `sameVal == 1` triggers 
the special-case handling using `prefixSumOfOnes`): 
[single-loop.log](https://github.com/gsmiller/decode-128-ints-benchmark/blob/pfor-is-it-vectorizing/single-loop.log),
 
[two-loops.log](https://github.com/gsmiller/decode-128-ints-benchmark/blob/pfor-is-it-vectorizing/two-loops.log)
   
   [1]
   ```
   private static void prefixSumOf(long val, long[] arr, long base) {
   for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
   arr[i] = IDENTITY_PLUS_ONE[i] * val + base;
   }
   }
   ```
   [2]
   ```
 0.45%β†—  0x7f37dfa02c52: mov%r9,%r8
 0.30%β”‚  0x7f37dfa02c55: movabs $0xd2816340,%rdi   ;   
{oop([J{0xd2816340})}
 3.16%β”‚  0x7f37dfa02c5f: imul   0x10(%rdi,%r10,8),%r8
 1.12%β”‚  0x7f37dfa02c65: add%rcx,%r8
 1.42%β”‚  0x7f37dfa02c68: mov%r8,0x10(%rbx,%r10,8)
 2.97%β”‚  0x7f37dfa02c6d: mov%r9,%r8
 2.62%β”‚  0x7f37dfa02c70: imul   0x18(%rdi,%r10,8),%r8
 1.37%β”‚  0x7f37dfa02c76: add%rcx,%r8
 1.40%β”‚  0x7f37dfa02c79: mov%r8,0x18(%rbx,%r10,8)
 5.71%β”‚  0x7f37dfa02c7e: mov%r9,%r8
 2.02%β”‚  0x7f37dfa02c81: imul   0x20(%rdi,%r10,8),%r8
 1.08%β”‚  0x7f37dfa02c87: add%rcx,%r8
 1.91%β”‚  0x7f37dfa02c8a: mov%r8,0x20(%rbx,%r10,8)
 4.96%β”‚  0x7f37dfa02c8f: mov%r9,%r8
 1.57%β”‚  0x7f37dfa02c92: imul   0x28(%rdi,%r10,8),%r8
 0.71%β”‚  0x7f37dfa02c98: add%rcx,%r8
 0.56%β”‚  0x7f37dfa02c9b: mov%r8,0x28(%rbx,%r10,8)  ;*lastore 
{reexecute=0 rethrow=0 return_oop=0}
  β”‚; - 
jpountz.PForDeltaDecoder::prefixSumOf@24 (line 29)
  β”‚; - 
jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59)
  β”‚; - 
jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29)
  β”‚; - 
jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151
 (line 240)
 4.79%β”‚  0x7f37dfa02ca0: add$0x4,%r10d ;*iinc 
{reexecute=0 rethrow=0 return_oop=0}
  β”‚; - 
jpountz.PForDeltaDecoder::prefixSumOf@25 (line 28)
  β”‚; - 
jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59)
  β”‚; - 
jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29)
  β”‚; - 
jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151
 (line 240)
 1.22%β”‚  0x7f37dfa02ca4: cmp$0x7d,%r10d
  

[GitHub] [lucene] rmuir commented on a change in pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


rmuir commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609724812



##
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
   in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
 }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+// This loop gets auto-vectorized
+for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+  longs[i] += base;
+}
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+  longs[i] = (i + 1) * val + base;

Review comment:
   i dont know enough about the autovectorization to trick it. But in 
general i wonder if we should replace some instances of `ForUtil.BLOCK_SIZE` 
with `array.length` etc where possible to make the compiler's job easier wrt 
bounds checks and loop processing. arrays are being passed in as parameters, 
these are private static methods, so i don't know how smart it is about this 
today :) On the other hand, maybe it is a waste of your time and just costs a 
single inexpensive check up front...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


jpountz commented on pull request #69:
URL: https://github.com/apache/lucene/pull/69#issuecomment-815845326


   > I think we leave the implementation as is and hope that we can do 
something better with more explicit vectorization support in the future
   
   (Sorry replying here as Github prevents me from replying on the existing 
thread)
   
   +1 Let go with whichever of `arr[i] = IDENTITY_PLUS_ONE[i] * val + base` or 
`arr[i] = (i+1) * val + base` runs fastest in your micro benchmark. We can 
still improve things later if we find a way to trick the JVM into 
auto-vectorizing this loop.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


gsmiller commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609733110



##
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
   in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
 }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+// This loop gets auto-vectorized
+for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+  longs[i] += base;
+}
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+  longs[i] = (i + 1) * val + base;

Review comment:
   Interesting thought @rmuir. I'll tweak this to see what kind of 
difference it makes, but we can't replace `ForUtil.BLOCK_SIZE` with 
`array.length` in the production code. The array length is actually one more 
than `ForUtil.BLOCK_SIZE` (as used in `Lucene90PostingsReader`). (See 
[L317](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PostingsReader.java#L317)
 for example). It populates that 129th value with `NO_MORE_DOCS` (i.e., 
`MAX_INT`) as a end-of-block marker. This was the source of a very frustrating 
debugging effort on my part while working on this, since early on I was 
actually using `array.length` instead :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


gsmiller commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609733110



##
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
   in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
 }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+// This loop gets auto-vectorized
+for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+  longs[i] += base;
+}
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+  longs[i] = (i + 1) * val + base;

Review comment:
   Interesting thought @rmuir. I'll tweak this to see what kind of 
difference it makes, but we can't replace `ForUtil.BLOCK_SIZE` with 
`array.length` in the production code. The array length is actually one more 
than `ForUtil.BLOCK_SIZE` (as used in `Lucene90PostingsReader`). (See 
[L317](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PostingsReader.java#L317)
 for example). It populates that 129th value with `NO_MORE_DOCS` (i.e., 
`MAX_INT`) as a end-of-block marker. This was the source of a very frustrating 
debugging effort on my part while working on this, since early on I was 
actually using `array.length` instead :)
   
   UPDATE: I tried with both `array.length - 1` (which we'd need to actually 
use in production) as well as `array.length` (just to see if it mattered) and 
didn't get any auto-vectorization. The assembly looked the same to my eye. 
Thanks for the suggestion though!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a change in pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


rmuir commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609750704



##
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
   in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
 }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+// This loop gets auto-vectorized
+for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+  longs[i] += base;
+}
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+  longs[i] = (i + 1) * val + base;

Review comment:
   ok, maybe useful for the future to look at. perhaps we could "hold" 
ForUtil different from postings reader and avoid this. Or maybe you could try 
something like `array.length & ~(BLOCK_SIZE - 1)` which is similar to what 
VectorSpecies.loopBound does when writing manually vectorized code. I found it 
was quite sensitive to this stuff.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #73: LUCENE-9916: add a simple regeneration help doc

2021-04-08 Thread GitBox


uschindler commented on pull request #73:
URL: https://github.com/apache/lucene/pull/73#issuecomment-815870339


   I have a question: why do we need this "tidy" at end of command line? If it 
is always required, it could be triffered automatically?
   
   I know this is unrelated to the documentation issue, but whenever I see any 
of those instructions, this puts questions in my eyes: πŸ€”


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a change in pull request #73: LUCENE-9916: add a simple regeneration help doc

2021-04-08 Thread GitBox


uschindler commented on a change in pull request #73:
URL: https://github.com/apache/lucene/pull/73#discussion_r609763034



##
File path: help/regeneration.txt
##
@@ -0,0 +1,23 @@
+Regeneration
+
+
+Lucene makes use of some generated code (e.g. jflex tokenizers).
+
+Examples below assume cwd at the gradlew script in the top directory of
+the project's checkout.
+
+
+Generic regeneration commands
+--
+
+Regenerate code:
+
+gradlew regenerate tidy

Review comment:
   I would indexnt those lines. Maybe use Markdown for whole help files?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler edited a comment on pull request #73: LUCENE-9916: add a simple regeneration help doc

2021-04-08 Thread GitBox


uschindler edited a comment on pull request #73:
URL: https://github.com/apache/lucene/pull/73#issuecomment-815870339


   I have a question: why do we need this "tidy" at end of command line? If it 
is always required, it could be triggered automatically?
   
   I know this is unrelated to the documentation issue, but whenever I see any 
of those instructions, this puts questions in my eyes: πŸ€”


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a change in pull request #73: LUCENE-9916: add a simple regeneration help doc

2021-04-08 Thread GitBox


rmuir commented on a change in pull request #73:
URL: https://github.com/apache/lucene/pull/73#discussion_r609772375



##
File path: help/regeneration.txt
##
@@ -0,0 +1,23 @@
+Regeneration
+
+
+Lucene makes use of some generated code (e.g. jflex tokenizers).
+
+Examples below assume cwd at the gradlew script in the top directory of
+the project's checkout.
+
+
+Generic regeneration commands
+--
+
+Regenerate code:
+
+gradlew regenerate tidy

Review comment:
   FYI I followed the style of existing help docs which do not indent, see 
tests.txt. I would say +1 to markdown as the current format is alien, and 
markdown would give good rendering on github?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9827) Small segments are slower to merge due to stored fields since 8.7

2021-04-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317245#comment-17317245
 ] 

ASF subversion and git services commented on LUCENE-9827:
-

Commit e510ef11c2a4307dd6ecc8c8974eef2c04e3e4d6 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e510ef1 ]

LUCENE-9827: Propagate `numChunks` through bulk merges.


> Small segments are slower to merge due to stored fields since 8.7
> -
>
> Key: LUCENE-9827
> URL: https://issues.apache.org/jira/browse/LUCENE-9827
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: main (9.0)
>
> Attachments: Indexer.java, log-and-lucene-9827.patch, 
> merge-count-by-num-docs.png, merge-type-by-version.png, 
> total-merge-time-by-num-docs-on-small-segments.png, 
> total-merge-time-by-num-docs.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [~dm] and [~dimitrisli] looked into an interesting case where indexing slowed 
> down after upgrading to 8.7. After digging we identified that this was due to 
> the merging of stored fields, which had become slower on average.
> This is due to changes to stored fields, which now have top-level blocks that 
> are then split into sub-blocks and compressed using shared dictionaries (one 
> dictionary per top-level block). As the top-level blocks are larger than they 
> were before, segments are more likely to be considered "dirty" by the merging 
> logic. Dirty segments are segments were 1% of the data or more consists of 
> incomplete blocks. For large segments, the size of blocks doesn't really 
> affect the dirtiness of segments: if you flush a segment that has 100 blocks 
> or more, it will never be considered dirty as only the last block may be 
> incomplete. But for small segments it does: for instance if your segment is 
> only 10 blocks, it is very likely considered dirty given that the last block 
> is always incomplete. And the fact that we increased the top-level block size 
> means that segments that used to be considered clean might now be considered 
> dirty.
> And indeed benchmarks reported that while large stored fields merges became 
> slightly faster after upgrading to 8.7, the smaller merges actually became 
> slower. See attached chart, which gives the total merge time as a function of 
> the number of documents in the segment.
> I don't know how we can address this, this is a natural consequence of the 
> larger block size, which is needed to achieve better compression ratios. But 
> I wanted to open an issue about it in case someone has a bright idea how we 
> could make things better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9917) Reduce block size for BEST_COMPRESSION

2021-04-08 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-9917:


 Summary: Reduce block size for BEST_COMPRESSION
 Key: LUCENE-9917
 URL: https://issues.apache.org/jira/browse/LUCENE-9917
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


As benchmarks suggested major savings and minor slowdowns with larger block 
sizes, I had increased them on LUCENE-9486. However it looks like this slowdown 
is still problematic for some users, so I plan to go back to a smaller block 
size, something like 10*16kB to get closer to the amount of data we had to 
decompress per document when we had 16kB blocks without shared dictionaries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9913) TestCompressingTermVectorsFormat.testMergeStability can fail assertion

2021-04-08 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-9913.
--
Resolution: Fixed

Argh, it is a bug in bulk merges due to numChunks not being propagated on the 
optimized merge code path. I'm glad this test caught it. I just pushed a fix.

I'm not setting a fixVersion since this bug wasn't released.

> TestCompressingTermVectorsFormat.testMergeStability can fail assertion
> --
>
> Key: LUCENE-9913
> URL: https://issues.apache.org/jira/browse/LUCENE-9913
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Julie Tibshirani
>Priority: Major
>
> This reproduces for me on {{main}}:
> {code:java}
> ./gradlew test --tests TestCompressingTermVectorsFormat.testMergeStability \
>   -Dtests.seed=502C0E17C8769082 -Dtests.nightly=true \
>   -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=gd-GB \
>   -Dtests.timezone=Africa/Accra -Dtests.asserts=true \
>   -Dtests.file.encoding=UTF-8
> {code}
> Failure excerpt:
> {code:java}
>  > java.lang.AssertionError: expected:<{tvd=33526, fnm=698, nvm=283, 
> tvm=164, tmd=826, fdm=158, pos=10508, fdt=1121, tvx=339, doc=13302, 
> tim=22354, tip=101, fdx=202, nvd=18983}> but was:<{tvd=33526, fnm=698, 
> nvm=283, tvm=163, tmd=826, fdm=157, pos=10508, fdt=1121, tvx=339, doc=13302, 
> tim=22354, tip=101, fdx=202, nvd=18983}>
>> at 
> __randomizedtesting.SeedInfo.seed([502C0E17C8769082:24604838C59C9234]:0)
>> at org.junit.Assert.fail(Assert.java:89)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #73: LUCENE-9916: add a simple regeneration help doc

2021-04-08 Thread GitBox


dweiss commented on a change in pull request #73:
URL: https://github.com/apache/lucene/pull/73#discussion_r609801705



##
File path: help/regeneration.txt
##
@@ -0,0 +1,23 @@
+Regeneration
+
+
+Lucene makes use of some generated code (e.g. jflex tokenizers).
+
+Examples below assume cwd at the gradlew script in the top directory of
+the project's checkout.
+
+
+Generic regeneration commands
+--
+
+Regenerate code:
+
+gradlew regenerate tidy

Review comment:
   I hate those markup formats and live in txt world... Also, these files 
are sourced (and printed) as part of helpXXX tasks which you can invoke from 
gradlew. Don't know if this matters (I'm sure there is a plugin somewhere that 
renders them into ascii console opcodes...).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #73: LUCENE-9916: add a simple regeneration help doc

2021-04-08 Thread GitBox


dweiss commented on a change in pull request #73:
URL: https://github.com/apache/lucene/pull/73#discussion_r609807318



##
File path: help/regeneration.txt
##
@@ -0,0 +1,23 @@
+Regeneration
+
+
+Lucene makes use of some generated code (e.g. jflex tokenizers).
+
+Examples below assume cwd at the gradlew script in the top directory of
+the project's checkout.
+
+
+Generic regeneration commands
+--
+
+Regenerate code:
+
+gradlew regenerate tidy
+
+Force-regenerate code, even when it isn't necessary:
+
+gradlew --rerun-tasks regenerate tidy
+
+Force-regenerate code, except for one tokenizer which is extremely slow:

Review comment:
   Most regeneration tasks are incremental at the moment - they do sense if 
they need to run or not. There should be a big red "last resort" option in this 
help file because in 99% of cases this should do the job: gradlew regenerate. 
That's it. Skips over tasks that have the same inputs/ outputs, regenerates and 
tidies up everything else. I've tested it on Linux and Windows and it really 
does work. The trouble you fell into today was caused by the fact that you use 
the low-level regeneration task and regenerate has all sorts of tweaks to make 
those tasks incremental and clean up formatting, etc.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #73: LUCENE-9916: add a simple regeneration help doc

2021-04-08 Thread GitBox


dweiss commented on a change in pull request #73:
URL: https://github.com/apache/lucene/pull/73#discussion_r609808709



##
File path: help/regeneration.txt
##
@@ -0,0 +1,23 @@
+Regeneration
+
+
+Lucene makes use of some generated code (e.g. jflex tokenizers).
+
+Examples below assume cwd at the gradlew script in the top directory of
+the project's checkout.
+
+
+Generic regeneration commands
+--
+
+Regenerate code:
+
+gradlew regenerate tidy
+
+Force-regenerate code, even when it isn't necessary:
+
+gradlew --rerun-tasks regenerate tidy
+
+Force-regenerate code, except for one tokenizer which is extremely slow:

Review comment:
   An example of when --rerun-tasks is useful is when you tweak the code of 
the generation task itself (not the inputs/outputs but the task itself).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a change in pull request #73: LUCENE-9916: add a simple regeneration help doc

2021-04-08 Thread GitBox


uschindler commented on a change in pull request #73:
URL: https://github.com/apache/lucene/pull/73#discussion_r609817750



##
File path: help/regeneration.txt
##
@@ -0,0 +1,23 @@
+Regeneration
+
+
+Lucene makes use of some generated code (e.g. jflex tokenizers).
+
+Examples below assume cwd at the gradlew script in the top directory of
+the project's checkout.
+
+
+Generic regeneration commands
+--
+
+Regenerate code:
+
+gradlew regenerate tidy
+
+Force-regenerate code, even when it isn't necessary:
+
+gradlew --rerun-tasks regenerate tidy
+
+Force-regenerate code, except for one tokenizer which is extremely slow:

Review comment:
   I figured out that gradle also rexecutes tasks if you change its source 
file (at least in the past this worked). I tested this at least when developing 
the renderJavadocs classes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a change in pull request #73: LUCENE-9916: add a simple regeneration help doc

2021-04-08 Thread GitBox


uschindler commented on a change in pull request #73:
URL: https://github.com/apache/lucene/pull/73#discussion_r609819216



##
File path: help/regeneration.txt
##
@@ -0,0 +1,23 @@
+Regeneration
+
+
+Lucene makes use of some generated code (e.g. jflex tokenizers).
+
+Examples below assume cwd at the gradlew script in the top directory of
+the project's checkout.
+
+
+Generic regeneration commands
+--
+
+Regenerate code:
+
+gradlew regenerate tidy

Review comment:
   Markdown is a good compromise. I just think we should use as minimal as 
possible, but e.g. make code parts inside `code` blocks or indent, so its 
blockquoted(sourceformatted automatically.
   
   I don't want full featured Markdown :-)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #73: LUCENE-9916: add a simple regeneration help doc

2021-04-08 Thread GitBox


dweiss commented on pull request #73:
URL: https://github.com/apache/lucene/pull/73#issuecomment-815914300


   Leave this patch open, Robert. There is one more non-trivial bit (checksum 
saving) that I need to explain there - I'll do it once I get back home.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy closed pull request #2082: SOLR-15002: Upgrade HttpClient to 4.5.13

2021-04-08 Thread GitBox


janhoy closed pull request #2082:
URL: https://github.com/apache/lucene-solr/pull/2082


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


gsmiller commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609945080



##
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
   in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
 }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+// This loop gets auto-vectorized
+for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+  longs[i] += base;
+}
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+  longs[i] = (i + 1) * val + base;

Review comment:
   @rmuir interesting. Yeah, I'd like to explore this further, but I wonder 
if it makes sense to do so in a follow-up Jira? For starters, intuitively, this 
case seems pretty uncommon. It will only kick in when all deltas are the same 
value, but aren't `1`. "Dense" blocks seem like the common case for using 0 
bpv, where all deltas would be `1`, and that case is definitely optimized 
already (`prefixSumOfOnes`). In fact, `ForDeltaUtil` doesn't even use 0 bpv for 
any case other than `1` (it doesn't actually store the "same value", but rather 
infers that it's `1` if bpv == 0). So this is already more efficient than what 
`ForDeltaUtil` is doing for these cases, in the sense that `ForDeltaUtil` would 
actually fully encode the deltas and go through the whole dane of decoding 
them, etc.
   
   @rmuir / @jpountz Any concern with me creating a follow-on issue to further 
investigate and move forward with this PR in its current state?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9917) Reduce block size for BEST_COMPRESSION

2021-04-08 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317379#comment-17317379
 ] 

Robert Muir commented on LUCENE-9917:
-

do you mean BEST_SPEED here?

> Reduce block size for BEST_COMPRESSION
> --
>
> Key: LUCENE-9917
> URL: https://issues.apache.org/jira/browse/LUCENE-9917
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> As benchmarks suggested major savings and minor slowdowns with larger block 
> sizes, I had increased them on LUCENE-9486. However it looks like this 
> slowdown is still problematic for some users, so I plan to go back to a 
> smaller block size, something like 10*16kB to get closer to the amount of 
> data we had to decompress per document when we had 16kB blocks without shared 
> dictionaries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a change in pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


rmuir commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609952804



##
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
   in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
 }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+// This loop gets auto-vectorized
+for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+  longs[i] += base;
+}
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+  longs[i] = (i + 1) * val + base;

Review comment:
   yes definitely followup or whatever. since you asked for suggestions i 
was just brainstorming... not necessary to be done here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9918) Can PForUtil be further auto-vectorized?

2021-04-08 Thread Greg Miller (Jira)
Greg Miller created LUCENE-9918:
---

 Summary: Can PForUtil be further auto-vectorized?
 Key: LUCENE-9918
 URL: https://issues.apache.org/jira/browse/LUCENE-9918
 Project: Lucene - Core
  Issue Type: Task
  Components: core/codecs
Affects Versions: main (9.0)
Reporter: Greg Miller


While working on LUCENE-9850, we discovered the loop in PForUtil::prefixSumOf 
is not getting auto-vectorized by the HotSpot compiler. We tried a few 
different tweaks to see if we could change this, but came up empty. There are 
some additional suggestions in the related 
[PR|https://github.com/apache/lucene/pull/69#discussion_r608412309] that could 
still be experimented with, and it may be worth doing so to see if further 
improvements could be squeezed out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


gsmiller commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609959666



##
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
   in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
 }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+// This loop gets auto-vectorized
+for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+  longs[i] += base;
+}
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+  longs[i] = (i + 1) * val + base;

Review comment:
   Thanks @rmuir, I went ahead and created 
[LUCENE-9918](https://issues.apache.org/jira/browse/LUCENE-9918). I appreciate 
the additional suggestions! This stuff is super interesting and a bit out of my 
wheelhouse, so I love having more ideas to experiments with :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #73: LUCENE-9916: add a simple regeneration help doc

2021-04-08 Thread GitBox


rmuir commented on pull request #73:
URL: https://github.com/apache/lucene/pull/73#issuecomment-816036750


   sure, please anyone push improvements, i just wanted to get it started.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #69: LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR)

2021-04-08 Thread GitBox


gsmiller commented on pull request #69:
URL: https://github.com/apache/lucene/pull/69#issuecomment-816040529


   @jpountz 
   
   > +1 Let go with whichever of `arr[i] = IDENTITY_PLUS_ONE[i] * val + base` 
or `arr[i] = (i+1) * val + base` runs fastest in your micro benchmark. We can 
still improve things later if we find a way to trick the JVM into 
auto-vectorizing this loop.
   
   Perfect, thanks! I'm changing this back to `(i + 1) * val + base` because it 
(somewhat surprisingly maybe, but I suppose this simple addition could be more 
efficient than an array reference) does consistently perform slightly better in 
microbenchmarks (`arraryRef == 0` is this implementation while `arrayRef == 1` 
references `IDENTITY_PLUS_ONE[i]`):
   ```
   Benchmark(arrayRef)  (bitsPerValue)  
(exceptionCount)  (sameVal)   Mode  Cnt  Score   Error   Units
   PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder   0   0  
   0  2  thrpt   20  7.915 Β± 0.008  ops/us
   PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder   1   0  
   0  2  thrpt   20  7.695 Β± 0.010  ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #73: LUCENE-9916: add a simple regeneration help doc

2021-04-08 Thread GitBox


dweiss commented on pull request #73:
URL: https://github.com/apache/lucene/pull/73#issuecomment-816110667


   I pushed a commit - sorry for being verbose. Hope this will helps you (and 
others) understand how I think it should work. Not every task is incremental 
yet (and I didn't clarify that). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #73: LUCENE-9916: add a simple regeneration help doc

2021-04-08 Thread GitBox


rmuir commented on pull request #73:
URL: https://github.com/apache/lucene/pull/73#issuecomment-816171377


   super-helpful, thank you @dweiss ! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9918) Can PForUtil be further auto-vectorized?

2021-04-08 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317473#comment-17317473
 ] 

Greg Miller commented on LUCENE-9918:
-

I've setup a microbenchmark project over 
[here|https://github.com/gsmiller/lucene-pfor-benchmark]Β to help explore this 
more easily if anyone is interested. I'll probably mess around with this a bit, 
but don't let that stop you from working on it if interested :)

> Can PForUtil be further auto-vectorized?
> 
>
> Key: LUCENE-9918
> URL: https://issues.apache.org/jira/browse/LUCENE-9918
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
>
> While working on LUCENE-9850, we discovered the loop in PForUtil::prefixSumOf 
> is not getting auto-vectorized by the HotSpot compiler. We tried a few 
> different tweaks to see if we could change this, but came up empty. There are 
> some additional suggestions in the related 
> [PR|https://github.com/apache/lucene/pull/69#discussion_r608412309] that 
> could still be experimented with, and it may be worth doing so to see if 
> further improvements could be squeezed out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9918) Can PForUtil be further auto-vectorized?

2021-04-08 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317493#comment-17317493
 ] 

Greg Miller commented on LUCENE-9918:
-

I think my (fairly naive) question here is mainly why the "multiplication loop" 
in the below code isn't able to get vectorized. Both the array copy and the 
"addition loop" are getting vectorized, but not the "multiplication loop." 
(I've put the decompiled assembly that I believe is relevant in the README in 
the above-referenced benchmark project).
{code:java}
Β protected void prefixSumOf(long[] longs, long base, long val) {
System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
longs[i] *= val;
}
for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
longs[i] += base;
}
}
{code}

> Can PForUtil be further auto-vectorized?
> 
>
> Key: LUCENE-9918
> URL: https://issues.apache.org/jira/browse/LUCENE-9918
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
>
> While working on LUCENE-9850, we discovered the loop in PForUtil::prefixSumOf 
> is not getting auto-vectorized by the HotSpot compiler. We tried a few 
> different tweaks to see if we could change this, but came up empty. There are 
> some additional suggestions in the related 
> [PR|https://github.com/apache/lucene/pull/69#discussion_r608412309] that 
> could still be experimented with, and it may be worth doing so to see if 
> further improvements could be squeezed out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9918) Can PForUtil be further auto-vectorized?

2021-04-08 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317494#comment-17317494
 ] 

Greg Miller commented on LUCENE-9918:
-

I'll also mention that [~rcmuir]Β has some thoughts overΒ 
[here|https://github.com/apache/lucene/pull/69#discussion_r609750704] on some 
other ideas to try if anyone is interested in poking around more.

> Can PForUtil be further auto-vectorized?
> 
>
> Key: LUCENE-9918
> URL: https://issues.apache.org/jira/browse/LUCENE-9918
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
>
> While working on LUCENE-9850, we discovered the loop in PForUtil::prefixSumOf 
> is not getting auto-vectorized by the HotSpot compiler. We tried a few 
> different tweaks to see if we could change this, but came up empty. There are 
> some additional suggestions in the related 
> [PR|https://github.com/apache/lucene/pull/69#discussion_r608412309] that 
> could still be experimented with, and it may be worth doing so to see if 
> further improvements could be squeezed out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani opened a new pull request #74: LUCENE-9705: Correct the format names in Lucene90StoredFieldsFormat

2021-04-08 Thread GitBox


jtibshirani opened a new pull request #74:
URL: https://github.com/apache/lucene/pull/74


   We accidentally kept the old names when creating the new format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org