[jira] [Updated] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-06 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-9201:

Attachment: javadocHTML5.png
javadocHTML4.png
javadocGRADLE.png

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032105#comment-17032105
 ] 

Robert Muir commented on LUCENE-9201:
-

I looked in more on the package.html/overview.html. This seems to be purely a 
gradle issue of not passing all the parameters to "javadocs".

I tested 3 cases: 
* HTML4 frames output, java 8
* HTML5 output, java 11
* gradle (HTML5) output, java 11

 !javadocGRADLE.png!  !javadocHTML4.png!  !javadocHTML5.png! 

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032105#comment-17032105
 ] 

Robert Muir edited comment on LUCENE-9201 at 2/7/20 4:51 AM:
-

I looked in more on the package.html/overview.html. This seems to be purely a 
gradle issue of not passing all the parameters to "javadocs".

I tested 3 cases: 
* HTML4 frames output, java 8
   !javadocHTML4.png!
* HTML5 output, java 11
   !javadocHTML5.png! 
* gradle (HTML5) output, java 11
   !javadocGRADLE.png!


was (Author: rcmuir):
I looked in more on the package.html/overview.html. This seems to be purely a 
gradle issue of not passing all the parameters to "javadocs".

I tested 3 cases: 
* HTML4 frames output, java 8
* HTML5 output, java 11
* gradle (HTML5) output, java 11

 !javadocGRADLE.png!  !javadocHTML4.png!  !javadocHTML5.png! 

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032109#comment-17032109
 ] 

Robert Muir commented on LUCENE-9201:
-

The missing overview.html is caused by bugs in the defaults-javadoc.gradle code:

{code}
  opts.overview = file("src/main/java/overview.html").toString()
{code}

It points to the wrong place for most lucene modules which are {{src/java}} not 
{{src/main/java}}.

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032110#comment-17032110
 ] 

Robert Muir commented on LUCENE-9201:
-

I will push the obvious fix to master, but it would be great to improve the 
gradle code. I think we need explicit file exists check so that the build fails 
clearly if the overview.html is missing. It should be present for any of these 
artifacts.

 

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032111#comment-17032111
 ] 

ASF subversion and git services commented on LUCENE-9201:
-

Commit a77bb1e6f57ed21d484c3927d710679166918878 in lucene-solr's branch 
refs/heads/master from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a77bb1e ]

LUCENE-9201: add overview.html from correct location to the javadocs in gradle 
build


> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9210) gradle javadocs doesn't incorporate CSS/JS

2020-02-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032115#comment-17032115
 ] 

Robert Muir commented on LUCENE-9210:
-

The syntax highlighting sure makes the code snippets easy on the eyes. The ant 
build accomplishes this by concatenating additional CSS and JS code directly to 
the output files. Maybe there is a less evil way?

> gradle javadocs doesn't incorporate CSS/JS
> --
>
> Key: LUCENE-9210
> URL: https://issues.apache.org/jira/browse/LUCENE-9210
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> We add to the javadoc css/jss some stuff:
> * Prettify.css/js (syntax highlighting)
> * a few styles to migrate table cellpadding: LUCENE-9209
> The ant task concatenates the stuff to the end of the resulting javadocs 
> css/js.
> We should either do this also in the gradle build or remove our reliance on 
> this stuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032119#comment-17032119
 ] 

Robert Muir commented on LUCENE-9201:
-

FYI I also discovered LUCENE-9210 as a part of investigations here, it is 
another TODO. With ant, the additional custom js/css syntax-highlights the 
sample code. Also some CSS classes are used for the HTML5 transition. It only 
impacts presentation, so it doesn't cause failures, but it would be good to fix 
gradle to use these.

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14066) Deprecate DIH

2020-02-06 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-14066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031385#comment-17031385
 ] 

Jan Høydahl commented on SOLR-14066:


[~rohitcse] are you still willing to maintain DIH? What support do you need 
from the community to get started?

> Deprecate DIH
> -
>
> Key: SOLR-14066
> URL: https://issues.apache.org/jira/browse/SOLR-14066
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
> Attachments: image-2019-12-14-19-58-39-314.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> DataImportHandler has outlived its utility. DIH doesn't need to remain inside 
> Solr anymore. Plan is to deprecate DIH in 8.5, remove from 9.0. Also, work on 
> handing it off to volunteers in the community (so far, [~rohitcse] has 
> volunteered to maintain it).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9147) Move the stored fields index off-heap

2020-02-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031412#comment-17031412
 ] 

ASF subversion and git services commented on LUCENE-9147:
-

Commit fdf5ade727ea8a5a6232d421a33b3fa1495d93b3 in lucene-solr's branch 
refs/heads/master from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fdf5ade ]

LUCENE-9147: Fix codec excludes.


> Move the stored fields index off-heap
> -
>
> Key: LUCENE-9147
> URL: https://issues.apache.org/jira/browse/LUCENE-9147
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now that the terms index is off-heap by default, it's almost embarrassing 
> that many indices spend most of their memory usage on the stored fields index 
> or the term vectors index, which are much less performance-sensitive than the 
> terms index. We should move them off-heap too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9147) Move the stored fields index off-heap

2020-02-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031411#comment-17031411
 ] 

ASF subversion and git services commented on LUCENE-9147:
-

Commit 3246b2605869549dfbcedef21ea24d7101c20eee in lucene-solr's branch 
refs/heads/branch_8x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3246b26 ]

LUCENE-9147: Fix codec excludes.


> Move the stored fields index off-heap
> -
>
> Key: LUCENE-9147
> URL: https://issues.apache.org/jira/browse/LUCENE-9147
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now that the terms index is off-heap by default, it's almost embarrassing 
> that many indices spend most of their memory usage on the stored fields index 
> or the term vectors index, which are much less performance-sensitive than the 
> terms index. We should move them off-heap too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9147) Move the stored fields index off-heap

2020-02-06 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-9147.
--
Fix Version/s: 8.5
   Resolution: Fixed

> Move the stored fields index off-heap
> -
>
> Key: LUCENE-9147
> URL: https://issues.apache.org/jira/browse/LUCENE-9147
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.5
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now that the terms index is off-heap by default, it's almost embarrassing 
> that many indices spend most of their memory usage on the stored fields index 
> or the term vectors index, which are much less performance-sensitive than the 
> terms index. We should move them off-heap too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2020-02-06 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031461#comment-17031461
 ] 

Xin-Chun Zhang commented on LUCENE-9004:


??You don't share your test code, but I suspect you open new IndexReader every 
time you issue a query???

[~tomoko] The test code can be found in 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/KnnIvfAndGraphPerformTester.java].
 Yes, I opened a new reader for each query in hope that IVFFlat and HNSW are 
compared in a fair condition since IVFFlat do not have cache. I now realize it 
may lead to OOM, hence replacing with a shard IndexReader and the problem 
resolved.

 

Update -- Top 1 in-set (query vector is in the candidate data set) recall 
results on SIFT1M data set ([http://corpus-texmex.irisa.fr/]) of IVFFlat and 
HNSW are as follows,

IVFFlat (no cache, reuse IndexReader)

 
||nprobe||avg. search time (ms)||recall percent (%)||
|8|13.3165|64.8|
|16|13.968|79.65|
|32|16.951|89.3|
|64|21.631|95.6|
|128|31.633|98.8|

 

HNSW (static cache, reuse IndexReader)
||avg. search time (ms)||recall percent (%)||
|6.3|{color:#FF}20.45{color}|

It can readily be shown that HNSW performs much better in query time. But I was 
surprised that top 1 in-set recall percent of HNSW is so low. It shouldn't be a 
problem of algorithm itself, but more likely a problem of implementation or 
test code. I will check it this weekend.

 

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from s

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2020-02-06 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028283#comment-17028283
 ] 

Xin-Chun Zhang edited comment on LUCENE-9004 at 2/6/20 10:52 AM:
-

 ??"Is it making life difficult to keep them separate?"??

[~sokolov] No, we can keep them separate at present. I have merged your 
[branch|[https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2]] 
into my person 
[github|[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] in 
order to do the comparison between IVFFlat and HNSW. And I reused some work 
that [~tomoko] and you did. Code refactoring is required when we are going to 
commit.

 

??"Have you tried comparing them on real data?"??

[~yurymalkov], [~mikemccand] Thanks for your advice. I haven't do it yet, and 
will do it soon. 

 

*Update – Feb. 4, 2020*

I have added two performance test tool 
(KnnIvfPerformTester/KnnIvfAndGraphPerformTester) into my personal branch. And 
sift1M dataset (1000,000 base vectors with 128 dimensions, 
[http://corpus-texmex.irisa.fr/]) is employed for the test. Top 1 recall 
performance of IVFFlat is as follows, *a new IndexReader was opened for each 
query*,

centroids=707
||nprobe||avg. search time (ms)||recall percent (%)||
|8|71.314|69.15|
|16|121.7565|82.3|
|32|155.692|92.85|
|64|159.3655|98.7|
|128|217.5205|99.9|

centroids=4000
||nprobe||avg. search time (ms)||recall percent (%)||
|8|56.3745|65.35|
|16|59.5435|78.85|
|32|71.751|89.85|
|64|90.396|96.25|
|128|135.3805|99.3|

Unfortunately, I couldn't obtain the corresponding results of HNSW due to the 
out of memory error in my PC. A special case with 2,000 base vectors 
demonstrates that IVFFlat is faster and more accurate. HNSW may outperform 
IVFFlat on larger data sets when larger memory is available, as shown in 
[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors].


was (Author: irvingzhang):
 ??"Is it making life difficult to keep them separate?"??

[~sokolov] No, we can keep them separate at present. I have merged your 
[branch|[https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2]] 
into my person 
[github|[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] in 
order to do the comparison between IVFFlat and HNSW. And I reused some work 
that [~tomoko] and you did. Code refactoring is required when we are going to 
commit.

 

??"Have you tried comparing them on real data?"??

[~yurymalkov], [~mikemccand] Thanks for your advice. I haven't do it yet, and 
will do it soon. 

 

*Update – Feb. 4, 2020*

I have added two performance test tool 
(KnnIvfPerformTester/KnnIvfAndGraphPerformTester) into my personal branch. And 
sift1M dataset (1000,000 base vectors with 128 dimensions, 
[http://corpus-texmex.irisa.fr/]) is employed for the test. Top 1 recall 
performance of IVFFlat is as follows,

centroids=707
||nprobe||avg. search time (ms)||recall percent (%)||
|8|71.314|69.15|
|16|121.7565|82.3|
|32|155.692|92.85|
|64|159.3655|98.7|
|128|217.5205|99.9|

centroids=4000
||nprobe||avg. search time (ms)||recall percent (%)||
|8|56.3745|65.35|
|16|59.5435|78.85|
|32|71.751|89.85|
|64|90.396|96.25|
|128|135.3805|99.3|

Unfortunately, I couldn't obtain the corresponding results of HNSW due to the 
out of memory error in my PC. A special case with 2,000 base vectors 
demonstrates that IVFFlat is faster and more accurate. HNSW may outperform 
IVFFlat on larger data sets when larger memory is available, as shown in 
[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors].

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume yo

[GitHub] [lucene-solr] markharwood commented on issue #1234: Add compression for Binary doc value fields

2020-02-06 Thread GitBox
markharwood commented on issue #1234: Add compression for Binary doc value 
fields
URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-582866319
 
 
   >And how can indexing and searching get so much faster when 
compress/decompress is in the path!
   
   I tried benchmarking some straight-forward file read and write operations 
(no Lucene) and couldn't show LZ4 compression being faster (although it wasn't 
that much slower).
   
   Maybe the rate-limited merging in Lucene plays a part and size therefore 
matters in that context?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14194) Allow Highlighting to work for indexes with uniqueKey that is not stored

2020-02-06 Thread Andrzej Wislowski (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Wislowski updated SOLR-14194:
-
Attachment: SOLR-14194.patch

> Allow Highlighting to work for indexes with uniqueKey that is not stored
> 
>
> Key: SOLR-14194
> URL: https://issues.apache.org/jira/browse/SOLR-14194
> Project: Solr
>  Issue Type: Improvement
>  Components: highlighter
>Affects Versions: master (9.0)
>Reporter: Andrzej Wislowski
>Assignee: David Smiley
>Priority: Minor
>  Labels: highlighter
> Attachments: SOLR-14194.patch, SOLR-14194.patch, SOLR-14194.patch, 
> SOLR-14194.patch
>
>
> Highlighting requires uniqueKey to be a stored field. I have changed 
> Highlighter allow returning results on indexes with uniqueKey that is a not 
> stored field, but saved as a docvalue type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9207) Don't build SpanQuery in QueryBuilder

2020-02-06 Thread Jim Ferenczi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031526#comment-17031526
 ] 

Jim Ferenczi commented on LUCENE-9207:
--

+1, I agree that the current optimization can help in some cases but the crazy 
expansion that can arises on phrase queries of shingles of different size 
should be considered a bug. We already disable graph queries in Elasticsearch 
if the analyzer contains a filter that is known to produce paths that don't 
align (shingles of different size in the same field) so we could probably add 
the same mechanism in Solr. I am also less worried by this issue now that we 
eagerly check the number of path while building (and throw max boolean clause 
if the number of paths is above the max boolean clause).

> Don't build SpanQuery in QueryBuilder
> -
>
> Key: LUCENE-9207
> URL: https://issues.apache.org/jira/browse/LUCENE-9207
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Subtask of LUCENE-9204.  QueryBuilder currently has special logic for graph 
> phrase queries with no slop, constructing a spanquery that attempts to follow 
> all paths using a combination of OR and NEAR queries.  Given the known bugs 
> in this type of query (LUCENE-7398) and that we would like to move span 
> queries out of core in any case, we should remove this logic and just build a 
> disjunction of phrase queries, one phrase per path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14194) Allow Highlighting to work for indexes with uniqueKey that is not stored

2020-02-06 Thread Andrzej Wislowski (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031528#comment-17031528
 ] 

Andrzej Wislowski commented on SOLR-14194:
--

[~dsmiley] I have added updated patch with test fix

> Allow Highlighting to work for indexes with uniqueKey that is not stored
> 
>
> Key: SOLR-14194
> URL: https://issues.apache.org/jira/browse/SOLR-14194
> Project: Solr
>  Issue Type: Improvement
>  Components: highlighter
>Affects Versions: master (9.0)
>Reporter: Andrzej Wislowski
>Assignee: David Smiley
>Priority: Minor
>  Labels: highlighter
> Attachments: SOLR-14194.patch, SOLR-14194.patch, SOLR-14194.patch, 
> SOLR-14194.patch
>
>
> Highlighting requires uniqueKey to be a stored field. I have changed 
> Highlighter allow returning results on indexes with uniqueKey that is a not 
> stored field, but saved as a docvalue type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-06 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375252133
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -61,11 +66,13 @@
 
   IndexOutput data, meta;
   final int maxDoc;
+  private SegmentWriteState state;
 
 Review comment:
   make it final?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-06 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375278323
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int [] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte [] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+  } catch (Throwable exception) {
+IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+throw exception;
+  }
 }
-assert numDocsWithField <= maxDoc;
-meta.writeLong(data.getFilePointer() - start); // dataLength
 
-if (numDocsWithField == 0) {
-  meta.writeLong(-2); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else if (numDocsWithField == maxDoc) {
-  meta.writeLong(-1); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else {
-  long offset = data.getFilePointer();
-  meta.writeLong(offset); // docsWithFieldOffset
-  values = valuesProducer.getBinary(field);
-  final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, 
IndexedDISI.DEFAULT_DENSE_RANK_POWER);
-  meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength
-  meta.writeShort(jumpTableEntryCount);
-  meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
+void addDoc(int doc, BytesRef v) throws IOException {
+  if (blockAddressesStart < 0) {
+blockAddressesStart = data.getFilePointer();
+  }
+  docLengths[numDocsInCurrentBlock] = v.length;
+  block = ArrayUtil.grow(block, uncompressedBlockLength + v.length);
+  System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, 
v.length);
+  uncompressedBlockLength += v.length;
+  numDocsInCurrentBlock++;
+  if (numDocsInCurrentBlock == 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) {
+flushData();
+  }  
 }
 
-meta.writeInt(numDocsWithField);
-meta.writeInt(minLength);
-meta.writeInt(maxLength);
-if (maxLength > minLength) {
-  start = data.getFilePointer();
-  meta.writeLong(start);
+private void flushData() throws IOException {
+  if (numDocsInCurrentBlock > 0) {
+// Write offset to this block to temporary offsets file
+totalChunks++;
+long thisBlockStartPointer = data.getFilePointer();
+data.writeVInt(numDocsInCurrentBlock);
+for (int i = 0; i < numDocsInCurrentBlock; i++) {
+  data.writeVInt(docLengths[i]);
+}
+maxUncompressedBlockLength = Math.max(maxUncompressedBlockLength, 
uncompressedBlockLength);
+LZ4.compress(block,  0, uncompressedBlockLength, data, ht);
+numDocsInCurrentBlock = 0;
+uncompressedBlockLength = 0;
+maxPointer = data.getFilePointer();
+tempBinaryOffsets.writeVLong(maxPointer - thisBlockStartPointer);
+  }
+}
+
+void writeMetaData() throws IOException {
+  if (blo

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-06 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375274563
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int [] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte [] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+  } catch (Throwable exception) {
+IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+throw exception;
+  }
 
 Review comment:
   we usually do this like that instead, which helps avoid catching Throwable
   
   ```
   boolean success = false;
   try {
 // write header
   } finally {
 if (success == false) {
   // close
 }
   }
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-06 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375273736
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter  implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int [] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte [] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+  } catch (Throwable exception) {
+IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+throw exception;
+  }
 }
-assert numDocsWithField <= maxDoc;
-meta.writeLong(data.getFilePointer() - start); // dataLength
 
-if (numDocsWithField == 0) {
-  meta.writeLong(-2); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else if (numDocsWithField == maxDoc) {
-  meta.writeLong(-1); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else {
-  long offset = data.getFilePointer();
-  meta.writeLong(offset); // docsWithFieldOffset
-  values = valuesProducer.getBinary(field);
-  final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, 
IndexedDISI.DEFAULT_DENSE_RANK_POWER);
-  meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength
-  meta.writeShort(jumpTableEntryCount);
-  meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
+void  addDoc(int doc, BytesRef v) throws IOException {
+  if (blockAddressesStart < 0) {
+blockAddressesStart = data.getFilePointer();
+  }
+  docLengths[numDocsInCurrentBlock] = v.length;
+  block = ArrayUtil.grow(block, uncompressedBlockLength + v.length);
+  System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, 
v.length);
+  uncompressedBlockLength += v.length;
+  numDocsInCurrentBlock++;
+  if (numDocsInCurrentBlock == 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) {
+flushData();
+  }  
 }
 
-meta.writeInt(numDocsWithField);
-meta.writeInt(minLength);
-meta.writeInt(maxLength);
-if (maxLength > minLength) {
-  start = data.getFilePointer();
-  meta.writeLong(start);
+private void flushData() throws IOException {
+  if(numDocsInCurrentBlock > 0) {
+// Write offset to this block to temporary offsets file
+totalChunks++;
+long thisBlockStartPointer = data.getFilePointer();
+data.writeVInt(numDocsInCurrentBlock);
+for (int i = 0; i < numDocsInCurrentBlock; i++) {
 
 Review comment:
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-06 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375252907
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int [] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte [] block = new byte [1024 * 16];
 
 Review comment:
   ```suggestion
   byte[] block = new byte [1024 * 16];
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-06 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375275497
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int [] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte [] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+  } catch (Throwable exception) {
+IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+throw exception;
+  }
 }
-assert numDocsWithField <= maxDoc;
-meta.writeLong(data.getFilePointer() - start); // dataLength
 
-if (numDocsWithField == 0) {
-  meta.writeLong(-2); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else if (numDocsWithField == maxDoc) {
-  meta.writeLong(-1); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else {
-  long offset = data.getFilePointer();
-  meta.writeLong(offset); // docsWithFieldOffset
-  values = valuesProducer.getBinary(field);
-  final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, 
IndexedDISI.DEFAULT_DENSE_RANK_POWER);
-  meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength
-  meta.writeShort(jumpTableEntryCount);
-  meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
+void addDoc(int doc, BytesRef v) throws IOException {
+  if (blockAddressesStart < 0) {
+blockAddressesStart = data.getFilePointer();
+  }
 
 Review comment:
   it looks like we could set `blockAddressesStart` in the constructor instead?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-06 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375252836
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int [] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
 
 Review comment:
   we usually don't let spaces between the type of array elements and `[]`
   
   ```suggestion
   int[] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-06 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375277898
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int [] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte [] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+  } catch (Throwable exception) {
+IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+throw exception;
+  }
 }
-assert numDocsWithField <= maxDoc;
-meta.writeLong(data.getFilePointer() - start); // dataLength
 
-if (numDocsWithField == 0) {
-  meta.writeLong(-2); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else if (numDocsWithField == maxDoc) {
-  meta.writeLong(-1); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else {
-  long offset = data.getFilePointer();
-  meta.writeLong(offset); // docsWithFieldOffset
-  values = valuesProducer.getBinary(field);
-  final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, 
IndexedDISI.DEFAULT_DENSE_RANK_POWER);
-  meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength
-  meta.writeShort(jumpTableEntryCount);
-  meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
+void addDoc(int doc, BytesRef v) throws IOException {
+  if (blockAddressesStart < 0) {
+blockAddressesStart = data.getFilePointer();
+  }
+  docLengths[numDocsInCurrentBlock] = v.length;
+  block = ArrayUtil.grow(block, uncompressedBlockLength + v.length);
+  System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, 
v.length);
+  uncompressedBlockLength += v.length;
+  numDocsInCurrentBlock++;
+  if (numDocsInCurrentBlock == 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) {
+flushData();
+  }  
 }
 
-meta.writeInt(numDocsWithField);
-meta.writeInt(minLength);
-meta.writeInt(maxLength);
-if (maxLength > minLength) {
-  start = data.getFilePointer();
-  meta.writeLong(start);
+private void flushData() throws IOException {
+  if (numDocsInCurrentBlock > 0) {
+// Write offset to this block to temporary offsets file
+totalChunks++;
+long thisBlockStartPointer = data.getFilePointer();
+data.writeVInt(numDocsInCurrentBlock);
+for (int i = 0; i < numDocsInCurrentBlock; i++) {
+  data.writeVInt(docLengths[i]);
+}
+maxUncompressedBlockLength = Math.max(maxUncompressedBlockLength, 
uncompressedBlockLength);
+LZ4.compress(block,  0, uncompressedBlockLength, data, ht);
 
 Review comment:
   ```suggestion
   LZ4.compress(block, 0, uncompressedBlockLength, data, ht);
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-06 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375827346
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
 ##
 @@ -742,6 +755,107 @@ public BytesRef binaryValue() throws IOException {
 };
   }
 }
+  }  
+  
+  // Decompresses blocks of binary values to retrieve content
+  class BinaryDecoder {
+
+private final LongValues addresses;
+private final IndexInput compressedData;
+// Cache of last uncompressed block 
+private long lastBlockId = -1;
+private int []uncompressedDocEnds = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK];
+private int uncompressedBlockLength = 0;
+private int numDocsInBlock = 0;
+private final byte[] uncompressedBlock;
+private BytesRef uncompressedBytesRef;
+
+public BinaryDecoder(LongValues addresses, IndexInput compressedData, int 
biggestUncompressedBlockSize) {
+  super();
+  this.addresses = addresses;
+  this.compressedData = compressedData;
+  // pre-allocate a byte array large enough for the biggest uncompressed 
block needed.
+  this.uncompressedBlock = new byte[biggestUncompressedBlockSize];
+  
 
 Review comment:
   we could initialize uncompressedBytesRef from the uncompressed block:
   `uncompressedBytesRef = new BytesRef(uncompressedBlock)`
   and avoid creating new BytesRefs over and over in `decode`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14194) Allow Highlighting to work for indexes with uniqueKey that is not stored

2020-02-06 Thread Lucene/Solr QA (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031583#comment-17031583
 ] 

Lucene/Solr QA commented on SOLR-14194:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  1m  7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  1m  3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  1m  3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate ref guide {color} | 
{color:green}  1m  3s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 51m 37s{color} 
| {color:red} core in the patch failed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 56m 15s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | solr.search.TestTermsQParserPlugin |
|   | solr.TestDistributedGrouping |
|   | solr.cloud.HttpPartitionWithTlogReplicasTest |
|   | solr.handler.component.DistributedSpellCheckComponentTest |
|   | solr.analysis.PathHierarchyTokenizerFactoryTest |
|   | solr.update.processor.AtomicUpdatesTest |
|   | solr.update.PeerSyncTest |
|   | solr.search.stats.TestLRUStatsCache |
|   | solr.TestHighlightDedupGrouping |
|   | solr.handler.component.DistributedDebugComponentTest |
|   | solr.DisMaxRequestHandlerTest |
|   | solr.update.PeerSyncWithBufferUpdatesTest |
|   | solr.handler.component.DistributedFacetPivotSmallTest |
|   | solr.cloud.BasicZkTest |
|   | solr.search.function.TestSortByMinMaxFunction |
|   | solr.search.stats.TestExactStatsCache |
|   | solr.update.PeerSyncWithLeaderTest |
|   | solr.search.facet.DistributedFacetSimpleRefinementLongTailTest |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | SOLR-14194 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12992768/SOLR-14194.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  validaterefguide  |
| uname | Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 
10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / fdf5ade727e |
| ant | version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019 |
| Default Java | LTS |
| unit | 
https://builds.apache.org/job/PreCommit-SOLR-Build/680/artifact/out/patch-unit-solr_core.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-SOLR-Build/680/testReport/ |
| modules | C: solr/core solr/solr-ref-guide U: solr |
| Console output | 
https://builds.apache.org/job/PreCommit-SOLR-Build/680/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> Allow Highlighting to work for indexes with uniqueKey that is not stored
> 
>
> Key: SOLR-14194
> URL: https://issues.apache.org/jira/browse/SOLR-14194
> Project: Solr
>  Issue Type: Improvement
>  Components: highlighter
>Affects Versions: master (9.0)
>Reporter: Andrzej Wislowski
>Assignee: David Smiley
>Priority: Minor
>  Labels: highlighter
> Attachments: SOLR-14194.patch, SOLR-14194.patch, SOLR-14194.patch, 
> SOLR-14194.patch
>
>
> Highlighting requires uniqueKey to be a stored field. I have changed 
> Highlighter allow returning results on indexes with uniqueKey that is a not 
> stored field, but saved as a docvalue type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe,

[GitHub] [lucene-solr] romseygeek merged pull request #1097: LUCENE-9099: Correctly handle repeats in ORDERED and UNORDERED intervals

2020-02-06 Thread GitBox
romseygeek merged pull request #1097: LUCENE-9099: Correctly handle repeats in 
ORDERED and UNORDERED intervals
URL: https://github.com/apache/lucene-solr/pull/1097
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9099) Correctly handle repeats in ordered and unordered intervals

2020-02-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031632#comment-17031632
 ] 

ASF subversion and git services commented on LUCENE-9099:
-

Commit 7c1ba1aebeea540b67ae304deee60162baee2e12 in lucene-solr's branch 
refs/heads/master from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7c1ba1a ]

LUCENE-9099: Correctly handle repeats in ORDERED and UNORDERED intervals (#1097)

If you have repeating intervals in an ordered or unordered interval source, you 
currently 
get somewhat confusing behaviour:

* `ORDERED(a, a, b)` will return an extra interval over just a b if it first 
matches a a b, meaning
that you can get incorrect results if used in a `CONTAINING` filter - 
`CONTAINING(ORDERED(x, y), ORDERED(a, a, b))` will match on the document `a x a 
b y`
* `UNORDERED(a, a)` will match on documents that just containg a single a.

This commit adds a RepeatingIntervalsSource that correctly handles repeats 
within 
ordered and unordered sources. It also changes the way that gaps are calculated 
within 
ordered and unordered sources, by using a new width() method on 
IntervalIterator. The 
default implementation just returns end() - start() + 1, but 
RepeatingIntervalsSource 
instead returns the sum of the widths of its child iterators. This preserves 
maxgaps filtering 
on ordered and unordered sources that contain repeats.

In order to correctly handle matches in this scenario, IntervalsSource#matches 
now always 
returns an explicit IntervalsMatchesIterator rather than a plain 
MatchesIterator, which adds 
gaps() and width() methods so that submatches can be combined in the same way 
that 
subiterators are. Extra checks have been added to checkIntervals() to ensure 
that the same 
intervals are returned by both iterator and matches, and a fix to 
DisjunctionIntervalIterator#matches() is also included - 
DisjunctionIntervalIterator minimizes 
its intervals, while MatchesUtils.disjunction does not, so there was a 
discrepancy between 
the two methods.


> Correctly handle repeats in ordered and unordered intervals
> ---
>
> Key: LUCENE-9099
> URL: https://issues.apache.org/jira/browse/LUCENE-9099
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If you have repeating intervals in an ordered or unordered interval source, 
> you currently get somewhat confusing behaviour:
> * ORDERED(a, a, b) will return an extra interval over just `a b` if it first 
> matches `a a b`, meaning that you can get incorrect results if used in a 
> CONTAINING filter - CONTAINING(ORDERED(x, y), ORDERED(a, a, b)) will match on 
> the document `a x a b y`
> * UNORDERED(a, a) will match on documents that just containg a single `a`.
> It is possible to deal with the unordered case when building sources by 
> rewriting duplicates to nested ORDERED clauses, so that UNORDERED(a, b, c, a, 
> b) becomes UNORDERED(ORDERED(a, a), ORDERED(b, b), c), but this then breaks 
> MAXGAPS filtering.
> We should try and fix this within intervals themselves.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9099) Correctly handle repeats in ordered and unordered intervals

2020-02-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031648#comment-17031648
 ] 

ASF subversion and git services commented on LUCENE-9099:
-

Commit aa916bac3c3369a461afa06e384e070657c32973 in lucene-solr's branch 
refs/heads/branch_8x from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=aa916ba ]

LUCENE-9099: Correctly handle repeats in ORDERED and UNORDERED intervals (#1097)

If you have repeating intervals in an ordered or unordered interval source, you 
currently 
get somewhat confusing behaviour:

* `ORDERED(a, a, b)` will return an extra interval over just a b if it first 
matches a a b, meaning
that you can get incorrect results if used in a `CONTAINING` filter - 
`CONTAINING(ORDERED(x, y), ORDERED(a, a, b))` will match on the document `a x a 
b y`
* `UNORDERED(a, a)` will match on documents that just containg a single a.

This commit adds a RepeatingIntervalsSource that correctly handles repeats 
within 
ordered and unordered sources. It also changes the way that gaps are calculated 
within 
ordered and unordered sources, by using a new width() method on 
IntervalIterator. The 
default implementation just returns end() - start() + 1, but 
RepeatingIntervalsSource 
instead returns the sum of the widths of its child iterators. This preserves 
maxgaps filtering 
on ordered and unordered sources that contain repeats.

In order to correctly handle matches in this scenario, IntervalsSource#matches 
now always 
returns an explicit IntervalsMatchesIterator rather than a plain 
MatchesIterator, which adds 
gaps() and width() methods so that submatches can be combined in the same way 
that 
subiterators are. Extra checks have been added to checkIntervals() to ensure 
that the same 
intervals are returned by both iterator and matches, and a fix to 
DisjunctionIntervalIterator#matches() is also included - 
DisjunctionIntervalIterator minimizes 
its intervals, while MatchesUtils.disjunction does not, so there was a 
discrepancy between 
the two methods.


> Correctly handle repeats in ordered and unordered intervals
> ---
>
> Key: LUCENE-9099
> URL: https://issues.apache.org/jira/browse/LUCENE-9099
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If you have repeating intervals in an ordered or unordered interval source, 
> you currently get somewhat confusing behaviour:
> * ORDERED(a, a, b) will return an extra interval over just `a b` if it first 
> matches `a a b`, meaning that you can get incorrect results if used in a 
> CONTAINING filter - CONTAINING(ORDERED(x, y), ORDERED(a, a, b)) will match on 
> the document `a x a b y`
> * UNORDERED(a, a) will match on documents that just containg a single `a`.
> It is possible to deal with the unordered case when building sources by 
> rewriting duplicates to nested ORDERED clauses, so that UNORDERED(a, b, c, a, 
> b) becomes UNORDERED(ORDERED(a, a), ORDERED(b, b), c), but this then breaks 
> MAXGAPS filtering.
> We should try and fix this within intervals themselves.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9099) Correctly handle repeats in ordered and unordered intervals

2020-02-06 Thread Alan Woodward (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Woodward resolved LUCENE-9099.
---
Fix Version/s: 8.5
   Resolution: Fixed

> Correctly handle repeats in ordered and unordered intervals
> ---
>
> Key: LUCENE-9099
> URL: https://issues.apache.org/jira/browse/LUCENE-9099
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If you have repeating intervals in an ordered or unordered interval source, 
> you currently get somewhat confusing behaviour:
> * ORDERED(a, a, b) will return an extra interval over just `a b` if it first 
> matches `a a b`, meaning that you can get incorrect results if used in a 
> CONTAINING filter - CONTAINING(ORDERED(x, y), ORDERED(a, a, b)) will match on 
> the document `a x a b y`
> * UNORDERED(a, a) will match on documents that just containg a single `a`.
> It is possible to deal with the unordered case when building sources by 
> rewriting duplicates to nested ORDERED clauses, so that UNORDERED(a, b, c, a, 
> b) becomes UNORDERED(ORDERED(a, a), ORDERED(b, b), c), but this then breaks 
> MAXGAPS filtering.
> We should try and fix this within intervals themselves.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-06 Thread GitBox
markharwood commented on a change in pull request #1234: Add compression for 
Binary doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375903347
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int [] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte [] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+  } catch (Throwable exception) {
+IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+throw exception;
+  }
 }
-assert numDocsWithField <= maxDoc;
-meta.writeLong(data.getFilePointer() - start); // dataLength
 
-if (numDocsWithField == 0) {
-  meta.writeLong(-2); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else if (numDocsWithField == maxDoc) {
-  meta.writeLong(-1); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else {
-  long offset = data.getFilePointer();
-  meta.writeLong(offset); // docsWithFieldOffset
-  values = valuesProducer.getBinary(field);
-  final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, 
IndexedDISI.DEFAULT_DENSE_RANK_POWER);
-  meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength
-  meta.writeShort(jumpTableEntryCount);
-  meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
+void addDoc(int doc, BytesRef v) throws IOException {
+  if (blockAddressesStart < 0) {
+blockAddressesStart = data.getFilePointer();
+  }
 
 Review comment:
   I tried that and it didn't work - something else was writing to data in 
between constructor and addDoc calls


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mocobeta opened a new pull request #1242: LUCENE-9201: Port documentation-lint task to Gradle build

2020-02-06 Thread GitBox
mocobeta opened a new pull request #1242: LUCENE-9201: Port documentation-lint 
task to Gradle build
URL: https://github.com/apache/lucene-solr/pull/1242
 
 
   # Description
   
   This PR adds an equivalent to "documentation-lint" to Gradle build.
   
   # Solution
   
   The `gradle/validation/documentation-lint.gradle` includes 
   - `documentationLint` task that supposed to be called from `precommit` task,
   - a root project level sub-task `checkBrokenLinks`, 
   - sub-project level sub-tasks `ecjJavadocLint`, `checkMissingJavadocsClass`, 
and `checkMissingJavadocsMethod`.
   
   # Note
   
   For now, Python linters - `checkBrokenLinks`,  `checkMissingJavadocsClass` 
and `checkMissingJavadocsMethod` - will fail because the Gradle-generated 
Javadocs seems to be slightly different to Ant-generated ones.
   e.g.; 
   - Javadoc directory structure: "ant documentation" generates 
"analyzers-common" docs dir for "analysis/common" module, but "gradlew javadoc" 
generates "analysis/common" for the same module. I think we can adjust the 
structure, but where is the suitable place to do so? 
   - Package summary: "ant documentation" uses "package.html" as package 
summary description, but "gradlew javadoc" ignores "package.html" (so some 
packages lacks summary description in "package-summary.html" when building 
javadocs by Gradle). We might be able to make Gradle Javadoc task to properly 
handle "package.html" files with some options. Or, should we replace all 
"package.html" with "package-info.java" at this time? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9199) can't build javadocs on java 13+

2020-02-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031682#comment-17031682
 ] 

ASF subversion and git services commented on LUCENE-9199:
-

Commit 7f4560c59a71f271058f13b3b30901ca8c233022 in lucene-solr's branch 
refs/heads/master from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7f4560c ]

LUCENE-9199: allow building javadocs on java 13+


> can't build javadocs on java 13+
> 
>
> Key: LUCENE-9199
> URL: https://issues.apache.org/jira/browse/LUCENE-9199
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9199.patch
>
>
> The build tries to pass an option (--no-module-directories) that is no longer 
> valid: https://bugs.openjdk.java.net/browse/JDK-8215582



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9199) can't build javadocs on java 13+

2020-02-06 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-9199:

Affects Version/s: master (9.0)

> can't build javadocs on java 13+
> 
>
> Key: LUCENE-9199
> URL: https://issues.apache.org/jira/browse/LUCENE-9199
> Project: Lucene - Core
>  Issue Type: Task
>Affects Versions: master (9.0)
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9199.patch
>
>
> The build tries to pass an option (--no-module-directories) that is no longer 
> valid: https://bugs.openjdk.java.net/browse/JDK-8215582



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9199) can't build javadocs on java 13+

2020-02-06 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-9199:

Fix Version/s: master (9.0)

> can't build javadocs on java 13+
> 
>
> Key: LUCENE-9199
> URL: https://issues.apache.org/jira/browse/LUCENE-9199
> Project: Lucene - Core
>  Issue Type: Task
>Affects Versions: master (9.0)
>Reporter: Robert Muir
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9199.patch
>
>
> The build tries to pass an option (--no-module-directories) that is no longer 
> valid: https://bugs.openjdk.java.net/browse/JDK-8215582



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9199) can't build javadocs on java 13+

2020-02-06 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-9199.
-
Resolution: Fixed

The problem only impacted master branch, not branch_8x.

> can't build javadocs on java 13+
> 
>
> Key: LUCENE-9199
> URL: https://issues.apache.org/jira/browse/LUCENE-9199
> Project: Lucene - Core
>  Issue Type: Task
>Affects Versions: master (9.0)
>Reporter: Robert Muir
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9199.patch
>
>
> The build tries to pass an option (--no-module-directories) that is no longer 
> valid: https://bugs.openjdk.java.net/browse/JDK-8215582



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-06 Thread GitBox
markharwood commented on a change in pull request #1234: Add compression for 
Binary doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375914836
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int [] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte [] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+  } catch (Throwable exception) {
+IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+throw exception;
+  }
 }
-assert numDocsWithField <= maxDoc;
-meta.writeLong(data.getFilePointer() - start); // dataLength
 
-if (numDocsWithField == 0) {
-  meta.writeLong(-2); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else if (numDocsWithField == maxDoc) {
-  meta.writeLong(-1); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else {
-  long offset = data.getFilePointer();
-  meta.writeLong(offset); // docsWithFieldOffset
-  values = valuesProducer.getBinary(field);
-  final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, 
IndexedDISI.DEFAULT_DENSE_RANK_POWER);
-  meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength
-  meta.writeShort(jumpTableEntryCount);
-  meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
+void addDoc(int doc, BytesRef v) throws IOException {
+  if (blockAddressesStart < 0) {
+blockAddressesStart = data.getFilePointer();
+  }
+  docLengths[numDocsInCurrentBlock] = v.length;
+  block = ArrayUtil.grow(block, uncompressedBlockLength + v.length);
+  System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, 
v.length);
+  uncompressedBlockLength += v.length;
+  numDocsInCurrentBlock++;
+  if (numDocsInCurrentBlock == 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) {
+flushData();
+  }  
 }
 
-meta.writeInt(numDocsWithField);
-meta.writeInt(minLength);
-meta.writeInt(maxLength);
-if (maxLength > minLength) {
-  start = data.getFilePointer();
-  meta.writeLong(start);
+private void flushData() throws IOException {
+  if (numDocsInCurrentBlock > 0) {
+// Write offset to this block to temporary offsets file
+totalChunks++;
+long thisBlockStartPointer = data.getFilePointer();
+data.writeVInt(numDocsInCurrentBlock);
+for (int i = 0; i < numDocsInCurrentBlock; i++) {
+  data.writeVInt(docLengths[i]);
+}
+maxUncompressedBlockLength = Math.max(maxUncompressedBlockLength, 
uncompressedBlockLength);
+LZ4.compress(block,  0, uncompressedBlockLength, data, ht);
+numDocsInCurrentBlock = 0;
+uncompressedBlockLength = 0;
+maxPointer = data.getFilePointer();
+tempBinaryOffsets.writeVLong(maxPointer - thisBlockStartPointer);
+  }
+}
+
+void writeMetaData() throws IOException {
+  if 

[GitHub] [lucene-solr] markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-06 Thread GitBox
markharwood commented on a change in pull request #1234: Add compression for 
Binary doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375922373
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int [] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte [] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+  } catch (Throwable exception) {
+IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+throw exception;
+  }
 
 Review comment:
   What was the "+1" comment for line 407 about?
   I've seen encoding elsewhere that have n+1 offsets to record start of each 
value and the last offset is effectively the end of the last value. In this 
scenario I'm writing n value lengths. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-06 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031702#comment-17031702
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

[~erickerickson] I added sub-tasks equivalent to the ant targets.
 - -check-broken-links (this internally calls 
{{dev-tools/scripts/checkJavadocLinks.py}})
 - -check-missing-javadocs (this internally calls 
{{dev-tools/scripts/checkJavaDocs.py}} )

And I opened a PR :)

[https://github.com/apache/lucene-solr/pull/1242]

I think this is almost equivalent to Ant's "documentation-lint", with some 
notes below. [~erickerickson] [~dweiss] Could you review it?

*Note:*

For now, Python linters - {{checkBrokenLinks}}, {{checkMissingJavadocsClass}} 
and {{checkMissingJavadocsMethod}} - will fail because the Gradle-generated 
Javadocs seems to be slightly different to Ant-generated ones.
 * Javadoc directory structure: "ant documentation" generates 
"analyzers-common" docs dir for "analysis/common" module, but "gradlew javadoc" 
generates "analysis/common" for the same module. I think we can adjust the 
structure, but where is the suitable place to do so?
 * Package summary: "ant documentation" uses "package.html" as package summary 
description, but "gradlew javadoc" ignores "package.html" (so some packages 
lacks summary description in "package-summary.html" when building javadocs by 
Gradle). We might be able to make Gradle Javadoc task to properly handle 
"package.html" files with some options. Or, should we replace all 
"package.html" with "package-info.java" at this time?

After Gradle generated Javadoc is fixed, we can return to here and complete 
this sub-task.

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-06 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375927967
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int [] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte [] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+  } catch (Throwable exception) {
+IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+throw exception;
+  }
 
 Review comment:
   It was about optimizing for the case that all values have the same length. 
In that case we could still one bit of the first length to mean that all values 
have the same length for instance?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9147) Move the stored fields index off-heap

2020-02-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031712#comment-17031712
 ] 

ASF subversion and git services commented on LUCENE-9147:
-

Commit 85dba7356f32da6d577550a6dd6c5e6244556d87 in lucene-solr's branch 
refs/heads/master from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=85dba73 ]

LUCENE-9147: Make sure temporary files get deleted on all code paths.


> Move the stored fields index off-heap
> -
>
> Key: LUCENE-9147
> URL: https://issues.apache.org/jira/browse/LUCENE-9147
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.5
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now that the terms index is off-heap by default, it's almost embarrassing 
> that many indices spend most of their memory usage on the stored fields index 
> or the term vectors index, which are much less performance-sensitive than the 
> terms index. We should move them off-heap too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9147) Move the stored fields index off-heap

2020-02-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031711#comment-17031711
 ] 

ASF subversion and git services commented on LUCENE-9147:
-

Commit 6a380798a27e1ce777843a4322afba463e383acc in lucene-solr's branch 
refs/heads/branch_8x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=6a38079 ]

LUCENE-9147: Make sure temporary files get deleted on all code paths.


> Move the stored fields index off-heap
> -
>
> Key: LUCENE-9147
> URL: https://issues.apache.org/jira/browse/LUCENE-9147
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.5
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now that the terms index is off-heap by default, it's almost embarrassing 
> that many indices spend most of their memory usage on the stored fields index 
> or the term vectors index, which are much less performance-sensitive than the 
> terms index. We should move them off-heap too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9077) Gradle build

2020-02-06 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9077:
--
Attachment: LUCENE-9077-javadoc-locale-en-US.patch

> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9077-javadoc-locale-en-US.patch
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
>  * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) Do not require files to be committed before running precommit. (staged 
> files are fine).
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that should be added (I use IntelliJ and it 
> imports the project out of the box, without the need for any special tuning).
>  * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; 
> currently XSLT...)
>  * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
> from a binary distribution? 
>  * There is some python execution in check-broken-links and 
> check-missing-javadocs, not sure if it's been ported
>  * Nightly-smoke also have some python execution, not sure of the status.
>  * Precommit doesn't catch unused imports
>  
> *{color:#ff}Note:{color}* this builds on the work 

[jira] [Created] (SOLR-14246) Can't create core when server/solr has a file whose name starts with same string as core

2020-02-06 Thread arnoldbird (Jira)
arnoldbird created SOLR-14246:
-

 Summary: Can't create core when server/solr has a file whose name 
starts with same string as core
 Key: SOLR-14246
 URL: https://issues.apache.org/jira/browse/SOLR-14246
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Affects Versions: 6.6.6
 Environment: Centos 7
Reporter: arnoldbird


If server/solr contains a file named...

something-archive.tar.gz

...and you try to create a core named "something," the response is...

{{ERROR:Core 'something' already exists!}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031718#comment-17031718
 ] 

Robert Muir commented on LUCENE-9201:
-

I think gradle provides slightly different options to the javadoc tool than 
ant, which creates the problem. For example, gradle build has only one 
"linkoffline" but ant build has two. Such small differences could create broken 
links.


> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9077) Gradle build

2020-02-06 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031732#comment-17031732
 ] 

Tomoko Uchida commented on LUCENE-9077:
---

I found a JDK Javadoc tool related issue which was fixed on ant build on 
https://issues.apache.org/jira/browse/LUCENE-8738?focusedCommentId=16822659&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16822659.
 I attached the same workaround patch [^LUCENE-9077-javadoc-locale-en-US.patch] 
for graldle build. Will commit it soon.

 

> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9077-javadoc-locale-en-US.patch
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
>  * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) Do not require files to be committed before running precommit. (staged 
> files are fine).
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that should be added (I use IntelliJ and it 
> imports the project out of the box, without the need for any special tuning).
>  * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; 
> currently XSLT...)
>  * I didn't bother adding Solr dist/test-

[jira] [Commented] (LUCENE-9077) Gradle build

2020-02-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031739#comment-17031739
 ] 

ASF subversion and git services commented on LUCENE-9077:
-

Commit f3cd1dbde36d8fd85bd2e87dcfaffc8b03eec87c in lucene-solr's branch 
refs/heads/master from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f3cd1db ]

LUCENE-9077: Force locale en_US on Javadoc task (workaroud for JDK-8222793)


> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9077-javadoc-locale-en-US.patch
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
>  * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) Do not require files to be committed before running precommit. (staged 
> files are fine).
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that should be added (I use IntelliJ and it 
> imports the project out of the box, without the need for any special tuning).
>  * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; 
> currently XSLT...)
>  * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
> from a binary distribution? 
>  

[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

2020-02-06 Thread Munendra S N (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031756#comment-17031756
 ] 

Munendra S N commented on SOLR-11725:
-

Thanks [~hossman] for the review. I will share updated patch with changes entry 
including upgrade notes.

{quote}Wait... the conversation from 2017 wasn't resoved? What do we want to do 
about stddev of singleton sets? Solr currently returns 0.0, and Hoss seemed to 
think this was the right behavior. But the patch here would seem to change the 
behavior to return NaN (but I didn't test it...) . After a quick glance, it 
doesn't look like existing tests cover this case either?{quote}
Thanks [~ysee...@gmail.com] for the review. There are no tests to cover 
singleton case, I'm not sure if sample size is 0 is covered too.
I think changing the current behavior of singleton case should be taken up in 
separate issue as it concerns both the Classical stats and JSON aggregations. 
This patch doesn't change the current behavior, I will add tests to cover these 
cases

> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> -
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
>  Issue Type: Sub-task
>  Components: Facet Module
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-11725.patch, SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14007) Difference response format for percentile aggregation

2020-02-06 Thread Munendra S N (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031758#comment-17031758
 ] 

Munendra S N commented on SOLR-14007:
-

[~ysee...@gmail.com] WDYT?

> Difference response format for percentile aggregation
> -
>
> Key: SOLR-14007
> URL: https://issues.apache.org/jira/browse/SOLR-14007
> Project: Solr
>  Issue Type: Sub-task
>  Components: Facet Module
>Reporter: Munendra S N
>Priority: Major
>
> For percentile,
> In Stats component, the response format for percentile is {{NamedList}} but 
> in JSON facet, the format is either array or single value depending on number 
> of percentiles specified.
> Even if JSON percentile doesn't use NamedList, response format shouldn't 
> change based on number of percentiles



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-06 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031765#comment-17031765
 ] 

Dawid Weiss commented on LUCENE-9201:
-

Please leave a patch or pull request. I will review and provide feedback but no 
earlier than Sunday/Monday.

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-06 Thread GitBox
markharwood commented on a change in pull request #1234: Add compression for 
Binary doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r375974370
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter  implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int [] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte [] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+  } catch (Throwable exception) {
+IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+throw exception;
+  }
 }
-assert numDocsWithField <= maxDoc;
-meta.writeLong(data.getFilePointer() - start); // dataLength
 
-if (numDocsWithField == 0) {
-  meta.writeLong(-2); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else if (numDocsWithField == maxDoc) {
-  meta.writeLong(-1); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else {
-  long offset = data.getFilePointer();
-  meta.writeLong(offset); // docsWithFieldOffset
-  values = valuesProducer.getBinary(field);
-  final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, 
IndexedDISI.DEFAULT_DENSE_RANK_POWER);
-  meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength
-  meta.writeShort(jumpTableEntryCount);
-  meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
+void  addDoc(int doc, BytesRef v) throws IOException {
+  if (blockAddressesStart < 0) {
+blockAddressesStart = data.getFilePointer();
+  }
+  docLengths[numDocsInCurrentBlock] = v.length;
+  block = ArrayUtil.grow(block, uncompressedBlockLength + v.length);
+  System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, 
v.length);
+  uncompressedBlockLength += v.length;
+  numDocsInCurrentBlock++;
+  if (numDocsInCurrentBlock == 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) {
+flushData();
+  }  
 }
 
-meta.writeInt(numDocsWithField);
-meta.writeInt(minLength);
-meta.writeInt(maxLength);
-if (maxLength > minLength) {
-  start = data.getFilePointer();
-  meta.writeLong(start);
+private void flushData() throws IOException {
+  if(numDocsInCurrentBlock > 0) {
+// Write offset to this block to temporary offsets file
+totalChunks++;
+long thisBlockStartPointer = data.getFilePointer();
+data.writeVInt(numDocsInCurrentBlock);
+for (int i = 0; i < numDocsInCurrentBlock; i++) {
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@

[jira] [Commented] (SOLR-12930) Add developer documentation to source repo

2020-02-06 Thread Cassandra Targett (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031831#comment-17031831
 ] 

Cassandra Targett commented on SOLR-12930:
--

Sorry [~jpountz]. You're correct, they should not be included in binary 
artifacts. I don't know how to exclude them in either the Ant or Gradle builds, 
though.

The only thing I really know how to do here would be to revert the whole thing 
and someone else could take a stab at doing this some other time.

> Add developer documentation to source repo
> --
>
> Key: SOLR-12930
> URL: https://issues.apache.org/jira/browse/SOLR-12930
> Project: Solr
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Mark Miller
>Priority: Major
> Attachments: solr-dev-docs.zip
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob edited a comment on issue #1217: SOLR-14223 PublicKeyHandler consumes a lot of entropy during tests

2020-02-06 Thread GitBox
madrob edited a comment on issue #1217: SOLR-14223 PublicKeyHandler consumes a 
lot of entropy during tests
URL: https://github.com/apache/lucene-solr/pull/1217#issuecomment-582087251
 
 
   Wired this up so that we can get the keys loaded from disk - this setup 
seems to work for tests in `core` but not in `solrj` or `contrib` modules 
because they have different test sources? Should I copy the keys to each 
module, or is there a more elegant way to handle that?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14223) PublicKeyHandler consumes a lot of entropy during tests

2020-02-06 Thread Mike Drob (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031847#comment-17031847
 ] 

Mike Drob commented on SOLR-14223:
--

Based on [~rcmuir]'s suggestions, we can read a key from disk instead of 
generating a new one each time in our tests - this has the advantage of 
skipping the entropy consumption and also the expensive primality testing. PR 
is ready for final review if anybody is interested in taking a look.

> PublicKeyHandler consumes a lot of entropy during tests
> ---
>
> Key: SOLR-14223
> URL: https://issues.apache.org/jira/browse/SOLR-14223
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.4, 8.0
>Reporter: Mike Drob
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> After the changes in SOLR-12354 to eagerly create a {{PublicKeyHandler}} for 
> the CoreContainer, the creation of the underlying {{RSAKeyPair}} uses 
> {{SecureRandom}} to generate primes. This eats up a lot of system entropy and 
> can slow down tests significantly (I observed it adding 10s to an individual 
> test).
> Similar to what we do for SSL config for tests, we can swap in a non blocking 
> implementation of SecureRandom for the key pair generation to allow multiple 
> tests to run better in parallel. Primality testing with BigInteger is also 
> slow, so I'm not sure how much total speedup we can get here, maybe it's 
> worth checking if there are faster implementations out there in other 
> libraries.
> In production cases, this also blocks creation of all cores. We should only 
> create the Handler if necessary, i.e. if the existing authn/z tell us that 
> they won't support internode requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14219) OverseerSolrResponse's serialVersionUID has changed

2020-02-06 Thread Tomas Eduardo Fernandez Lobbe (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomas Eduardo Fernandez Lobbe updated SOLR-14219:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> OverseerSolrResponse's serialVersionUID has changed
> ---
>
> Key: SOLR-14219
> URL: https://issues.apache.org/jira/browse/SOLR-14219
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 8.5
>Reporter: Andy Webb
>Assignee: Tomas Eduardo Fernandez Lobbe
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> When the {{useUnsafeOverseerResponse=true}} option introduced in SOLR-14095 
> is used, the serialized OverseerSolrResponse has a different serialVersionUID 
> to earlier versions, making it backwards-incompatible.
> https://github.com/apache/lucene-solr/pull/1210 forces the serialVersionUID 
> to its old value, so old and new nodes become compatible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031858#comment-17031858
 ] 

Robert Muir commented on LUCENE-9201:
-

[~tomoko] has a pull request for this issue already: 
https://github.com/apache/lucene-solr/pull/1242

I will do some investigation too. Maybe the actual javadocs/pythonscripts side 
can be cleaned up to make this less painful.
For example it is not good that some python checks only can parse old javadocs 
format that has been removed since java 13. 
So even with the current ant build, things are not in great shape.
Ideally java's doclint would be leaned on more (e.g. enable html check, remove 
jtidy crap). It will make builds faster and reduce maintenance. Gotta at least 
try :)

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy

2020-02-06 Thread Robert Muir (Jira)
Robert Muir created LUCENE-9209:
---

 Summary: fix javadocs to be html5, enable doclint html checks, 
remove jtidy
 Key: LUCENE-9209
 URL: https://issues.apache.org/jira/browse/LUCENE-9209
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir


Currently doclint is very angry about all the {{}} elements and similar 
stuff going on. We claim to be emitting html5 documentation so it is about time 
to clean it up.

Then the html check can simply be enabled and we can remove the jtidy stuff 
completely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy

2020-02-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031876#comment-17031876
 ] 

Robert Muir commented on LUCENE-9209:
-

I'm working on some of the common issues (such as tt tag -> code tag, table 
summary attribute -> caption element, etc) 

> fix javadocs to be html5, enable doclint html checks, remove jtidy
> --
>
> Key: LUCENE-9209
> URL: https://issues.apache.org/jira/browse/LUCENE-9209
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Currently doclint is very angry about all the {{}} elements and similar 
> stuff going on. We claim to be emitting html5 documentation so it is about 
> time to clean it up.
> Then the html check can simply be enabled and we can remove the jtidy stuff 
> completely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob merged pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize

2020-02-06 Thread GitBox
madrob merged pull request #1184: LUCENE-9142 Refactor IntSet operations for 
determinize
URL: https://github.com/apache/lucene-solr/pull/1184
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9142) Add documentation to Operations.determinize, SortedIntSet, and FrozenSet

2020-02-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031889#comment-17031889
 ] 

ASF subversion and git services commented on LUCENE-9142:
-

Commit abd282d258d23d19b7f7c1e96332a19fa7b7b827 in lucene-solr's branch 
refs/heads/master from Mike
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=abd282d ]

LUCENE-9142 Refactor IntSet operations for determinize (#1184)

* LUCENE-9142 Refactor SortedIntSet for equality

Split SortedIntSet into a class heirarchy to make comparisons to
FrozenIntSet more meaningful. Use Arrays.equals for more efficient
comparison. Add tests for IntSet to verify correctness.

> Add documentation to Operations.determinize, SortedIntSet, and FrozenSet
> 
>
> Key: LUCENE-9142
> URL: https://issues.apache.org/jira/browse/LUCENE-9142
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Mike Drob
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Was tracing through the fuzzy query code, and IntelliJ helpfully pointed out 
> that we have mismatched types when trying to reuse states, and so we may be 
> creating more states than we need to.
> Relevant snippets:
> {code:title=Operations.java}
> Map newstate = new HashMap<>();
> final SortedIntSet statesSet = new SortedIntSet(5);
> Integer q = newstate.get(statesSet);
> {code}
> {{q}} is always going to be null in this path because there are no 
> SortedIntSet keys in the map.
> There are also very little javadoc on SortedIntSet, so I'm having trouble 
> following the precise relationship between all the pieces here.
> cc: [~mikemccand] [~romseygeek] - I would appreciate any pointers if you have 
> them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9142) Add documentation to Operations.determinize, SortedIntSet, and FrozenSet

2020-02-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031890#comment-17031890
 ] 

ASF subversion and git services commented on LUCENE-9142:
-

Commit abd282d258d23d19b7f7c1e96332a19fa7b7b827 in lucene-solr's branch 
refs/heads/master from Mike
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=abd282d ]

LUCENE-9142 Refactor IntSet operations for determinize (#1184)

* LUCENE-9142 Refactor SortedIntSet for equality

Split SortedIntSet into a class heirarchy to make comparisons to
FrozenIntSet more meaningful. Use Arrays.equals for more efficient
comparison. Add tests for IntSet to verify correctness.

> Add documentation to Operations.determinize, SortedIntSet, and FrozenSet
> 
>
> Key: LUCENE-9142
> URL: https://issues.apache.org/jira/browse/LUCENE-9142
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Mike Drob
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Was tracing through the fuzzy query code, and IntelliJ helpfully pointed out 
> that we have mismatched types when trying to reuse states, and so we may be 
> creating more states than we need to.
> Relevant snippets:
> {code:title=Operations.java}
> Map newstate = new HashMap<>();
> final SortedIntSet statesSet = new SortedIntSet(5);
> Integer q = newstate.get(statesSet);
> {code}
> {{q}} is always going to be null in this path because there are no 
> SortedIntSet keys in the map.
> There are also very little javadoc on SortedIntSet, so I'm having trouble 
> following the precise relationship between all the pieces here.
> cc: [~mikemccand] [~romseygeek] - I would appreciate any pointers if you have 
> them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9142) Add documentation to Operations.determinize, SortedIntSet, and FrozenSet

2020-02-06 Thread Mike Drob (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Drob resolved LUCENE-9142.
---
Fix Version/s: master (9.0)
 Assignee: Mike Drob
   Resolution: Fixed

> Add documentation to Operations.determinize, SortedIntSet, and FrozenSet
> 
>
> Key: LUCENE-9142
> URL: https://issues.apache.org/jira/browse/LUCENE-9142
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Was tracing through the fuzzy query code, and IntelliJ helpfully pointed out 
> that we have mismatched types when trying to reuse states, and so we may be 
> creating more states than we need to.
> Relevant snippets:
> {code:title=Operations.java}
> Map newstate = new HashMap<>();
> final SortedIntSet statesSet = new SortedIntSet(5);
> Integer q = newstate.get(statesSet);
> {code}
> {{q}} is always going to be null in this path because there are no 
> SortedIntSet keys in the map.
> There are also very little javadoc on SortedIntSet, so I'm having trouble 
> following the precise relationship between all the pieces here.
> cc: [~mikemccand] [~romseygeek] - I would appreciate any pointers if you have 
> them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14247) IndexSizeTriggerMixedBoundsTest does a lot of sleeping

2020-02-06 Thread Mike Drob (Jira)
Mike Drob created SOLR-14247:


 Summary: IndexSizeTriggerMixedBoundsTest does a lot of sleeping
 Key: SOLR-14247
 URL: https://issues.apache.org/jira/browse/SOLR-14247
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Tests
Reporter: Mike Drob


When I run tests locally, the slowest reported test is always 
IndexSizeTriggerMixedBoundsTest  coming in at around 2 minutes.

I took a look at the code and discovered that at least 80s of that is all 
sleeps!

There might need to be more synchronization and ordering added back in, but 
when I removed all of the sleeps the test still passed locally for me, so I'm 
not too sure what the point was or why we were slowing the system down so much.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14162) TestInjection can leak Timer objects

2020-02-06 Thread Mike Drob (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Drob resolved SOLR-14162.
--
Fix Version/s: master (9.0)
   Resolution: Fixed

> TestInjection can leak Timer objects
> 
>
> Key: SOLR-14162
> URL: https://issues.apache.org/jira/browse/SOLR-14162
> Project: Solr
>  Issue Type: Bug
>  Components: Tests
>Reporter: Mike Drob
>Priority: Minor
> Fix For: master (9.0)
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In TestInjection we track all of the outstanding timers for shutdown but try 
> to clean up based on the TimerTask instead of the Timer itself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy

2020-02-06 Thread Erick Erickson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032021#comment-17032021
 ] 

Erick Erickson commented on LUCENE-9209:


Are you including Solr subtree? If not, I can grab that part.

> fix javadocs to be html5, enable doclint html checks, remove jtidy
> --
>
> Key: LUCENE-9209
> URL: https://issues.apache.org/jira/browse/LUCENE-9209
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Currently doclint is very angry about all the {{}} elements and similar 
> stuff going on. We claim to be emitting html5 documentation so it is about 
> time to clean it up.
> Then the html check can simply be enabled and we can remove the jtidy stuff 
> completely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2020-02-06 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031461#comment-17031461
 ] 

Xin-Chun Zhang edited comment on LUCENE-9004 at 2/7/20 1:46 AM:


??You don't share your test code, but I suspect you open new IndexReader every 
time you issue a query???

[~tomoko] The test code can be found in 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/KnnIvfAndGraphPerformTester.java].
 Yes, I opened a new reader for each query in hope that IVFFlat and HNSW are 
compared in a fair condition since IVFFlat do not have cache. I now realize it 
may lead to OOM, hence replacing with a shared IndexReader and the problem 
resolved.

 

Update – Top 1 in-set (query vector is in the candidate data set) recall 
results on SIFT1M data set ([http://corpus-texmex.irisa.fr/]) of IVFFlat and 
HNSW are as follows,

IVFFlat (no cache, reuse IndexReader)

 
||nprobe||avg. search time (ms)||recall percent (%)||
|8|13.3165|64.8|
|16|13.968|79.65|
|32|16.951|89.3|
|64|21.631|95.6|
|128|31.633|98.8|

 

HNSW (static cache, reuse IndexReader)
||avg. search time (ms)||recall percent (%)||
|6.3|{color:#ff}20.45{color}|

It can readily be shown that HNSW performs much better in query time. But I was 
surprised that top 1 in-set recall percent of HNSW is so low. It shouldn't be a 
problem of algorithm itself, but more likely a problem of implementation or 
test code. I will check it this weekend.

 


was (Author: irvingzhang):
??You don't share your test code, but I suspect you open new IndexReader every 
time you issue a query???

[~tomoko] The test code can be found in 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/KnnIvfAndGraphPerformTester.java].
 Yes, I opened a new reader for each query in hope that IVFFlat and HNSW are 
compared in a fair condition since IVFFlat do not have cache. I now realize it 
may lead to OOM, hence replacing with a shard IndexReader and the problem 
resolved.

 

Update -- Top 1 in-set (query vector is in the candidate data set) recall 
results on SIFT1M data set ([http://corpus-texmex.irisa.fr/]) of IVFFlat and 
HNSW are as follows,

IVFFlat (no cache, reuse IndexReader)

 
||nprobe||avg. search time (ms)||recall percent (%)||
|8|13.3165|64.8|
|16|13.968|79.65|
|32|16.951|89.3|
|64|21.631|95.6|
|128|31.633|98.8|

 

HNSW (static cache, reuse IndexReader)
||avg. search time (ms)||recall percent (%)||
|6.3|{color:#FF}20.45{color}|

It can readily be shown that HNSW performs much better in query time. But I was 
surprised that top 1 in-set recall percent of HNSW is so low. It shouldn't be a 
problem of algorithm itself, but more likely a problem of implementation or 
test code. I will check it this weekend.

 

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The i

[jira] [Commented] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy

2020-02-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032061#comment-17032061
 ] 

Robert Muir commented on LUCENE-9209:
-

Yes, currently I am fixing solr parts too (so far, but not yet sure what I am 
walking into! So I may need help, lemme see if i can get lucene-core working 
first).

I look at each violation in lucene-core and then fix it across the entire 
source tree.
For example table "summary" attribute is complained about. I look up 
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/table and see that it 
says:
{quote}
This attribute defines an alternative text that summarizes the content of the 
table. Use the  element instead.
{quote}

So then I fix all occurrences in the whole source tree of table "summary" 
attribute by transforming into a caption, like this:
{noformat}
- * 
+ * 
+ *  comparison of dictionary and hyphenation based 
decompounding
{noformat}

I'll upload a patch with my current state. I still have 16 violations in 
lucene-core.

> fix javadocs to be html5, enable doclint html checks, remove jtidy
> --
>
> Key: LUCENE-9209
> URL: https://issues.apache.org/jira/browse/LUCENE-9209
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Currently doclint is very angry about all the {{}} elements and similar 
> stuff going on. We claim to be emitting html5 documentation so it is about 
> time to clean it up.
> Then the html check can simply be enabled and we can remove the jtidy stuff 
> completely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy

2020-02-06 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-9209:

Attachment: LUCENE-9209_current_state.patch

> fix javadocs to be html5, enable doclint html checks, remove jtidy
> --
>
> Key: LUCENE-9209
> URL: https://issues.apache.org/jira/browse/LUCENE-9209
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9209_current_state.patch
>
>
> Currently doclint is very angry about all the {{}} elements and similar 
> stuff going on. We claim to be emitting html5 documentation so it is about 
> time to clean it up.
> Then the html check can simply be enabled and we can remove the jtidy stuff 
> completely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9210) gradle javadocs doesn't incorporate CSS/JS

2020-02-06 Thread Robert Muir (Jira)
Robert Muir created LUCENE-9210:
---

 Summary: gradle javadocs doesn't incorporate CSS/JS
 Key: LUCENE-9210
 URL: https://issues.apache.org/jira/browse/LUCENE-9210
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir


We add to the javadoc css/jss some stuff:

* Prettify.css/js (syntax highlighting)
* a few styles to migrate table cellpadding: LUCENE-9209

The ant task concatenates the stuff to the end of the resulting javadocs css/js.

We should either do this also in the gradle build or remove our reliance on 
this stuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy

2020-02-06 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-9209:

Attachment: LUCENE-9209.patch

> fix javadocs to be html5, enable doclint html checks, remove jtidy
> --
>
> Key: LUCENE-9209
> URL: https://issues.apache.org/jira/browse/LUCENE-9209
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9209.patch, LUCENE-9209_current_state.patch
>
>
> Currently doclint is very angry about all the {{}} elements and similar 
> stuff going on. We claim to be emitting html5 documentation so it is about 
> time to clean it up.
> Then the html check can simply be enabled and we can remove the jtidy stuff 
> completely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy

2020-02-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032077#comment-17032077
 ] 

Robert Muir commented on LUCENE-9209:
-

patch attached: javadocs succeeds for all of lucene and solr. The html doclint 
option is enabled for gradle and ant, and all jtidy stuff is removed.

 I plan to commit this soon as it is about as boring as it gets, but will 
conflict with everything.

> fix javadocs to be html5, enable doclint html checks, remove jtidy
> --
>
> Key: LUCENE-9209
> URL: https://issues.apache.org/jira/browse/LUCENE-9209
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9209.patch, LUCENE-9209_current_state.patch
>
>
> Currently doclint is very angry about all the {{}} elements and similar 
> stuff going on. We claim to be emitting html5 documentation so it is about 
> time to clean it up.
> Then the html check can simply be enabled and we can remove the jtidy stuff 
> completely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy

2020-02-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032088#comment-17032088
 ] 

ASF subversion and git services commented on LUCENE-9209:
-

Commit 0d339043e378d8333c376bae89411b813de25b10 in lucene-solr's branch 
refs/heads/master from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0d33904 ]

LUCENE-9209: fix javadocs to be html5, enable doclint html checks, remove jtidy

Current javadocs declare an HTML5 doctype: !DOCTYPE HTML. Some HTML5
features are used, but unfortunately also some constructs that do not
exist in HTML5 are used as well.

Because of this, we have no checking of any html syntax. jtidy is
disabled because it works with html4. doclint is disabled because it
works with html5. our docs are neither.

javadoc "doclint" feature can efficiently check that the html isn't
crazy. we just have to fix really ancient removed/deprecated stuff
(such as use of tt tag).

This enables the html checking in both ant and gradle. The docs are
fixed via straightforward transformations.

One exception is table cellpadding, for this some helper CSS classes
were added to make the transition easier (since it must apply padding
to inner th/td, not possible inline). I added TODOs, we should clean
this up. Most problems look like they may have been generated from a
GUI or similar and not a human.


> fix javadocs to be html5, enable doclint html checks, remove jtidy
> --
>
> Key: LUCENE-9209
> URL: https://issues.apache.org/jira/browse/LUCENE-9209
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9209.patch, LUCENE-9209_current_state.patch
>
>
> Currently doclint is very angry about all the {{}} elements and similar 
> stuff going on. We claim to be emitting html5 documentation so it is about 
> time to clean it up.
> Then the html check can simply be enabled and we can remove the jtidy stuff 
> completely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9209) fix javadocs to be html5, enable doclint html checks, remove jtidy

2020-02-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032092#comment-17032092
 ] 

ASF subversion and git services commented on LUCENE-9209:
-

Commit 860115e4502175934bb1d7ae90f8bda65c464bb9 in lucene-solr's branch 
refs/heads/master from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=860115e ]

LUCENE-9209: revert changes to test html file, not intended


> fix javadocs to be html5, enable doclint html checks, remove jtidy
> --
>
> Key: LUCENE-9209
> URL: https://issues.apache.org/jira/browse/LUCENE-9209
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9209.patch, LUCENE-9209_current_state.patch
>
>
> Currently doclint is very angry about all the {{}} elements and similar 
> stuff going on. We claim to be emitting html5 documentation so it is about 
> time to clean it up.
> Then the html check can simply be enabled and we can remove the jtidy stuff 
> completely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032100#comment-17032100
 ] 

Robert Muir commented on LUCENE-9201:
-

{quote}
Package summary: "ant documentation" uses "package.html" as package summary 
description, but "gradlew javadoc" ignores "package.html" (so some packages 
lacks summary description in "package-summary.html" when building javadocs by 
Gradle). We might be able to make Gradle Javadoc task to properly handle 
"package.html" files with some options. Or, should we replace all 
"package.html" with "package-info.java" at this time?
{quote}

It is a stupid complex issue. The problem here also exists in the ant build. 
The underlying issue is that java 8 produced HTML4 by default. Java 9+ is doing 
HTML5. Java 9+ is generating HTML5 even though its manual page implies that 
HTML4 is still the default. This kind of shit makes our build too complicated. 

Possibly in branch 8.x we can simply pass "-html4" option to javadoc processor 
so that it always generates html4 output, even if you happen to use java 
9,10,11,12,13,etc to invoke it. Currently if you use java 9+ on this branch, 
javadocs are messed up, you get no overview.html, etc. Forcing html4 for this 
branch seems like the best way. I will investigate for sure.

On the other hand with master branch things are easier: java 11 is a minimum 
requirement, so let's not fight the defaults (which is HTML5). Maybe I am wrong 
here, if we change our minds we can just revert my commits and wire in HTML4. I 
fixed the syntax issues already in LUCENE-9209. We must get the overviews 
working again, so I am not sure if {{package-info.java}} is the best solution 
(it addresses package.html, but what about overview.html?)

Sorry for the long answer, but at least it is no mystery.

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032103#comment-17032103
 ] 

Robert Muir commented on LUCENE-9201:
-

With java 13+, the {{-html4}} option is removed. So we are forced to adopt 
html5. I feel a little better :)
Fixing branch_8x is hopeless: java 8 can only generate html4 and java13 can 
only generate html5.

So now we should focus on fixing package.html/overview.html in the master 
branch. I will look into it a bit.

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org