[GitHub] [lucene-jira-archive] mocobeta commented on pull request #139: Enable GitHub Pages for hosting attachment files.

2022-08-09 Thread GitBox


mocobeta commented on PR #139:
URL: 
https://github.com/apache/lucene-jira-archive/pull/139#issuecomment-1209022003

   Problem was solved by moving `attachments` to `/docs` folder 
(https://github.com/apache/lucene-jira-archive/commit/f5d8d00244495a35c9998aaf91b3021832ffada3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on pull request #139: Enable GitHub Pages for hosting attachment files.

2022-08-09 Thread GitBox


mocobeta commented on PR #139:
URL: 
https://github.com/apache/lucene-jira-archive/pull/139#issuecomment-1209032120

   The attachments' base URL is `https://apache.github.io/lucene-jira-archive`.
   e.g.
   
https://apache.github.io/lucene-jira-archive/attachments/LUCENE-10006/mypatch.patch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #141: use GitHub Pages instad of raw.githubusercontent.com

2022-08-09 Thread GitBox


mocobeta opened a new pull request, #141:
URL: https://github.com/apache/lucene-jira-archive/pull/141

   I moved all attachments files to `gh-pages` branch from `attachments` branch 
and enabled GitHub Pages for this repo. 
https://github.com/apache/lucene-jira-archive/pull/139
   Now all attachments can be accessed via 
`https://apache.github.io/lucene-jira-archive/`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta merged pull request #141: use GitHub Pages instad of raw.githubusercontent.com

2022-08-09 Thread GitBox


mocobeta merged PR #141:
URL: https://github.com/apache/lucene-jira-archive/pull/141


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #127: Consider using GitHub Pages for attachments rather than raw.githubusercontent.com

2022-08-09 Thread GitBox


mocobeta commented on issue #127:
URL: 
https://github.com/apache/lucene-jira-archive/issues/127#issuecomment-1209063006

   I had a bit of trouble with correctly enabling GitHub Pages in this repo, 
but finally, have done it.
   Now attachments are served via GitHub Pages instead of 
raw.githubusercontent.com.
   https://github.com/mocobeta/migration-test-3/issues/594
   
   Many thanks @vlsi for letting us know this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] visionarywind opened a new pull request, #1063: help scorer scan of memory codec

2022-08-09 Thread GitBox


visionarywind opened a new pull request, #1063:
URL: https://github.com/apache/lucene/pull/1063

   cut off useless search for scorer scan
   
   ### Description (or a Jira issue link if you have one)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10676) FieldInfo#name contributes significantly to heap usage at scale

2022-08-09 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577304#comment-17577304
 ] 

Robert Muir commented on LUCENE-10676:
--

I'm opposed to use of String.intern here.

The problem here is the user, they have 10,000 fields as you described. That's 
completely unnecessary.

> FieldInfo#name contributes significantly to heap usage at scale
> ---
>
> Key: LUCENE-10676
> URL: https://issues.apache.org/jira/browse/LUCENE-10676
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 9.3
> Environment: Seen in Lucene 9.3.0 running on Linux using JDK18 but 
> seems independent of environment.
>Reporter: David Turner
>Priority: Minor
>  Labels: heap, scalability
> Attachments: image-2022-08-08-13-23-37-050.png
>
>
> We encountered an Elasticsearch user with high heap usage, a significant 
> proportion of which was down to the contents of `FieldInfo#name`.
> This user was certainly pushing some scalability boundaries: this single 
> process had thousands of active Lucene indices, many with 10k+ fields, and 
> many indices had hundreds of segments due to an excess of flushes, so in 
> total they had an enormous number of `FieldInfo` instances. Still, the bulk 
> of the heap usage was just field names, and the total number of distinct 
> field names was fairly small. That's pretty common, especially for time-based 
> data like logs. Some kind of interning or deduplication of these strings 
> would have reduced their heap usage by many GBs.
> Is there a way we could deduplicate these strings? Deduplicating them across 
> segments within each index would already have helped, but ideally we'd like 
> to deduplicate them across indices too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #4: Which GitHub accont we should/can use for migration?

2022-08-09 Thread GitBox


mocobeta commented on issue #4:
URL: 
https://github.com/apache/lucene-jira-archive/issues/4#issuecomment-1209152955

   Hi,
   here is the INFRA issue.
   https://issues.apache.org/jira/browse/INFRA-23563
   
   Can you please watch the issue, and give comments if needed? I tried to 
explain our intricate requests but am not so confident that I'm doing well it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

2022-08-09 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577309#comment-17577309
 ] 

Robert Muir commented on LUCENE-10677:
--

I'm opposed to the use of string.intern by the lucene library here. It is 
inappropriate for a library (versus an app), there are plenty of discussions 
you can find about the problems it causes for apps.

If someone insists on having thousands of indexes with tens of thousands of 
fields, they can buy more RAM.

> Duplicate strings in FieldInfo#attributes contribute significantly to heap 
> usage at scale
> -
>
> Key: LUCENE-10677
> URL: https://issues.apache.org/jira/browse/LUCENE-10677
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 9.3
>Reporter: Armin Braun
>Priority: Minor
>  Labels: heap, scalability
> Attachments: lucene_duplicate_fields.png
>
>
> This has the same origin as issue LUCENE-10676 . Running a single process 
> with thousands of fields across many indexes will lead to a lot of duplicate 
> strings retained as keys and values in the `attributes` map. This can amount 
> to GBs of heap for thousands of fields across a few thousand segments. The 
> strings in the below heap dump analysis account for more than half  (roughly 
> 2/3 and the field names are somewhat unusually long in this example) the 
> duplicate strings from `FieldInfo` instances.
> If we could deduplicate theses obvious known strings when reading `FieldInfo` 
> we could save GBs of heap for use cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-09 Thread GitBox


msokolov commented on PR #1054:
URL: https://github.com/apache/lucene/pull/1054#issuecomment-1209274547

   I'll push to main later today if I don't see any further discussion, and let 
it percolate for a bit before backporting to 9.x


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

2022-08-09 Thread Armin Braun (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577393#comment-17577393
 ] 

Armin Braun commented on LUCENE-10677:
--

This wouldn't necessarily need string interning here. Looking at the real world 
examples I have of this, simply deduplicating a few known strings like 
"PerFieldPostingsFormat.format" would already be a huge memory saving for this 
map. Couldn't we just special case some known strings when deserializing that 
map to deal with the biggest offenders?

It's not just about RAM outright, saving the GC for these strings would be 
quite helpful as well, especially when a lot of these eventually become only 
weakly referenced through a chain from the segment readers which makes it hard 
to quickly collect them under heap pressure (which is what caused trouble in 
the case motivated this).

> Duplicate strings in FieldInfo#attributes contribute significantly to heap 
> usage at scale
> -
>
> Key: LUCENE-10677
> URL: https://issues.apache.org/jira/browse/LUCENE-10677
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 9.3
>Reporter: Armin Braun
>Priority: Minor
>  Labels: heap, scalability
> Attachments: lucene_duplicate_fields.png
>
>
> This has the same origin as issue LUCENE-10676 . Running a single process 
> with thousands of fields across many indexes will lead to a lot of duplicate 
> strings retained as keys and values in the `attributes` map. This can amount 
> to GBs of heap for thousands of fields across a few thousand segments. The 
> strings in the below heap dump analysis account for more than half  (roughly 
> 2/3 and the field names are somewhat unusually long in this example) the 
> duplicate strings from `FieldInfo` instances.
> If we could deduplicate theses obvious known strings when reading `FieldInfo` 
> we could save GBs of heap for use cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

2022-08-09 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577408#comment-17577408
 ] 

Dawid Weiss commented on LUCENE-10677:
--

It would help a lot if you could provide an example of how you ended up with 25 
million FieldInfo objects that cannot be garbage collected. This is weird and 
unexpected, I'd say.

> Duplicate strings in FieldInfo#attributes contribute significantly to heap 
> usage at scale
> -
>
> Key: LUCENE-10677
> URL: https://issues.apache.org/jira/browse/LUCENE-10677
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 9.3
>Reporter: Armin Braun
>Priority: Minor
>  Labels: heap, scalability
> Attachments: lucene_duplicate_fields.png
>
>
> This has the same origin as issue LUCENE-10676 . Running a single process 
> with thousands of fields across many indexes will lead to a lot of duplicate 
> strings retained as keys and values in the `attributes` map. This can amount 
> to GBs of heap for thousands of fields across a few thousand segments. The 
> strings in the below heap dump analysis account for more than half  (roughly 
> 2/3 and the field names are somewhat unusually long in this example) the 
> duplicate strings from `FieldInfo` instances.
> If we could deduplicate theses obvious known strings when reading `FieldInfo` 
> we could save GBs of heap for use cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #142: attach type:task label to Sub-tasks

2022-08-09 Thread GitBox


mocobeta opened a new pull request, #142:
URL: https://github.com/apache/lucene-jira-archive/pull/142

   Sub-tasks should have `type:task` label.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta merged pull request #142: attach type:task label to Sub-tasks

2022-08-09 Thread GitBox


mocobeta merged PR #142:
URL: https://github.com/apache/lucene-jira-archive/pull/142


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

2022-08-09 Thread Armin Braun (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577440#comment-17577440
 ] 

Armin Braun commented on LUCENE-10677:
--

[~dweiss] this happened on an Elasticsearch node that had ~150 indices that 
were actively indexed into. Each of those had about 2k fields and many of them 
ended up with ~100 segments which works out to about the number we're seeing 
since the `FieldInfo` objects seem to be duplicated across segments.

Even though we're admittedly dealing with a somewhat excessive number of fields 
here, it seems off that the strings from the attributes map are what's causing 
the biggest issue here performance wise doesn't it?

> Duplicate strings in FieldInfo#attributes contribute significantly to heap 
> usage at scale
> -
>
> Key: LUCENE-10677
> URL: https://issues.apache.org/jira/browse/LUCENE-10677
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 9.3
>Reporter: Armin Braun
>Priority: Minor
>  Labels: heap, scalability
> Attachments: lucene_duplicate_fields.png
>
>
> This has the same origin as issue LUCENE-10676 . Running a single process 
> with thousands of fields across many indexes will lead to a lot of duplicate 
> strings retained as keys and values in the `attributes` map. This can amount 
> to GBs of heap for thousands of fields across a few thousand segments. The 
> strings in the below heap dump analysis account for more than half  (roughly 
> 2/3 and the field names are somewhat unusually long in this example) the 
> duplicate strings from `FieldInfo` instances.
> If we could deduplicate theses obvious known strings when reading `FieldInfo` 
> we could save GBs of heap for use cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

2022-08-09 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577468#comment-17577468
 ] 

Dawid Weiss commented on LUCENE-10677:
--

String.intern is evil for many reasons and your use case is indeed, ahem, 
atypical. I don't think adding "a few known strings" is an elegant solution 
since hacks like this one tend to become stale quickly... You could try the 
JVM's UseStringDeduplication option - an ugly workaround but easy one - but I 
think you'll run into other problems soon enough with this number of concurrent 
indices/segments/fields. If you have to live with this then it's likely that 
you'll have to follow Rob's advice sooner or later.

> Duplicate strings in FieldInfo#attributes contribute significantly to heap 
> usage at scale
> -
>
> Key: LUCENE-10677
> URL: https://issues.apache.org/jira/browse/LUCENE-10677
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 9.3
>Reporter: Armin Braun
>Priority: Minor
>  Labels: heap, scalability
> Attachments: lucene_duplicate_fields.png
>
>
> This has the same origin as issue LUCENE-10676 . Running a single process 
> with thousands of fields across many indexes will lead to a lot of duplicate 
> strings retained as keys and values in the `attributes` map. This can amount 
> to GBs of heap for thousands of fields across a few thousand segments. The 
> strings in the below heap dump analysis account for more than half  (roughly 
> 2/3 and the field names are somewhat unusually long in this example) the 
> duplicate strings from `FieldInfo` instances.
> If we could deduplicate theses obvious known strings when reading `FieldInfo` 
> we could save GBs of heap for use cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on issue #4: Which GitHub accont we should/can use for migration?

2022-08-09 Thread GitBox


mikemccand commented on issue #4:
URL: 
https://github.com/apache/lucene-jira-archive/issues/4#issuecomment-1209483962

   Thanks @mocobeta!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

2022-08-09 Thread Armin Braun (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577484#comment-17577484
 ] 

Armin Braun commented on LUCENE-10677:
--

[~dweiss] maybe an alternative solution could be to promote known/common 
attributes to concrete fields in `FieldInfo` instead of using generic strings 
in a map maybe? That would save memory on the attribute keys and not run the 
risk of becoming stale?

> Duplicate strings in FieldInfo#attributes contribute significantly to heap 
> usage at scale
> -
>
> Key: LUCENE-10677
> URL: https://issues.apache.org/jira/browse/LUCENE-10677
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 9.3
>Reporter: Armin Braun
>Priority: Minor
>  Labels: heap, scalability
> Attachments: lucene_duplicate_fields.png
>
>
> This has the same origin as issue LUCENE-10676 . Running a single process 
> with thousands of fields across many indexes will lead to a lot of duplicate 
> strings retained as keys and values in the `attributes` map. This can amount 
> to GBs of heap for thousands of fields across a few thousand segments. The 
> strings in the below heap dump analysis account for more than half  (roughly 
> 2/3 and the field names are somewhat unusually long in this example) the 
> duplicate strings from `FieldInfo` instances.
> If we could deduplicate theses obvious known strings when reading `FieldInfo` 
> we could save GBs of heap for use cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] nknize merged pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape

2022-08-09 Thread GitBox


nknize merged PR #1017:
URL: https://github.com/apache/lucene/pull/1017


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10654) New companion doc value format for LatLonShape and XYShape field types

2022-08-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577561#comment-17577561
 ] 

ASF subversion and git services commented on LUCENE-10654:
--

Commit d7fd48c9502c567e4760a011fa99b1a491fea2cb in lucene's branch 
refs/heads/main from Nick Knize
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d7fd48c9502 ]

LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape (#1017)

Adds new doc value field to support LatLonShape and XYShape doc values. The
implementation is inspired by ComponentTree. A binary tree of tessellated
components (point, line, or triangle) is created. This tree is then DFS
serialized to a variable compressed DataOutput buffer to keep the doc value
format as compact as possible.

DocValue queries are performed on the serialized tree using a similar component
relation logic as found in SpatialQuery for BKD indexed shapes. To make this
possible some of the relation logic is refactored to make it accessible to the
doc value query counterpart.

Note this does not support the following:

* Multi Geometries or Collections - This will be investigated by exploring 
  the addition of multi binary doc values.
* General Geometry Queries - This will be added in a follow on improvement. 

Signed-off-by: Nicholas Walter Knize 

> New companion doc value format for LatLonShape and XYShape field types
> --
>
> Key: LUCENE-10654
> URL: https://issues.apache.org/jira/browse/LUCENE-10654
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Nick Knize
>Priority: Major
> Fix For: 9.4
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> {{XYDocValuesField}} provides doc value support for {{XYPoint}}. 
> {{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}.
> However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue 
> format. 
> This lack of doc value support for shapes means facets, aggregations, and 
> IndexOrDocValues queries are currently not possible for Shape field types. 
> This gap needs be closed in lucene.
> To support IndexOrDocValues queries along with various geometry aggregations 
> and facets, the ability to compute the spatial relation with the doc value is 
> needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since 
> the doc value encoding is nothing more than a simple 2D integer encoding of 
> the x,y and lat,lon dimensional components. Accomplishing the same with a 
> naive integer encoded binary representation for N-vertex shapes would be 
> costly. 
> {{ComponentTree}} already provides an efficient in memory structure for 
> quickly computing spatial relations over Shape types based on a binary tree 
> of tessellated triangles provided by the {{Tessellator}}. Furthermore, this 
> tessellation is already computed at index time. If we create an on-disk 
> representation of {{ComponentTree}} 's binary tree of tessellated triangles 
> and use this as the doc value {{binaryValue}} format we will be able to 
> efficiently compute spatial relations with this binary representation and 
> achieve the same facet/aggregation result over shapes as we can with points 
> today (e.g., grid facets, centroid, area, etc).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] nknize opened a new pull request, #1064: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape (#1017)

2022-08-09 Thread GitBox


nknize opened a new pull request, #1064:
URL: https://github.com/apache/lucene/pull/1064

   Backport of #1017 to branch_9x
   
   Adds new doc value field to support LatLonShape and XYShape doc values. The
   implementation is inspired by ComponentTree. A binary tree of tessellated
   components (point, line, or triangle) is created. This tree is then DFS
   serialized to a variable compressed DataOutput buffer to keep the doc value
   format as compact as possible.
   
   DocValue queries are performed on the serialized tree using a similar 
component
   relation logic as found in SpatialQuery for BKD indexed shapes. To make this
   possible some of the relation logic is refactored to make it accessible to 
the
   doc value query counterpart.
   
   Note this does not support the following:
   
   * Multi Geometries or Collections - This will be investigated by exploring
 the addition of multi binary doc values.
   * General Geometry Queries - This will be added in a follow on improvement.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] nknize merged pull request #1064: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape (#1017)

2022-08-09 Thread GitBox


nknize merged PR #1064:
URL: https://github.com/apache/lucene/pull/1064


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10654) New companion doc value format for LatLonShape and XYShape field types

2022-08-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577620#comment-17577620
 ] 

ASF subversion and git services commented on LUCENE-10654:
--

Commit ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 in lucene's branch 
refs/heads/branch_9x from Nick Knize
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ddf0d0acf4e ]

LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape (#1017) 
(#1064)

Adds new doc value field to support LatLonShape and XYShape doc values. The
implementation is inspired by ComponentTree. A binary tree of tessellated
components (point, line, or triangle) is created. This tree is then DFS
serialized to a variable compressed DataOutput buffer to keep the doc value
format as compact as possible.

DocValue queries are performed on the serialized tree using a similar component
relation logic as found in SpatialQuery for BKD indexed shapes. To make this
possible some of the relation logic is refactored to make it accessible to the
doc value query counterpart.

Note this does not support the following:

* Multi Geometries or Collections - This will be investigated by exploring
  the addition of multi binary doc values.
* General Geometry Queries - This will be added in a follow on improvement.

Signed-off-by: Nicholas Walter Knize 

> New companion doc value format for LatLonShape and XYShape field types
> --
>
> Key: LUCENE-10654
> URL: https://issues.apache.org/jira/browse/LUCENE-10654
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Nick Knize
>Priority: Major
> Fix For: 9.4
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> {{XYDocValuesField}} provides doc value support for {{XYPoint}}. 
> {{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}.
> However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue 
> format. 
> This lack of doc value support for shapes means facets, aggregations, and 
> IndexOrDocValues queries are currently not possible for Shape field types. 
> This gap needs be closed in lucene.
> To support IndexOrDocValues queries along with various geometry aggregations 
> and facets, the ability to compute the spatial relation with the doc value is 
> needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since 
> the doc value encoding is nothing more than a simple 2D integer encoding of 
> the x,y and lat,lon dimensional components. Accomplishing the same with a 
> naive integer encoded binary representation for N-vertex shapes would be 
> costly. 
> {{ComponentTree}} already provides an efficient in memory structure for 
> quickly computing spatial relations over Shape types based on a binary tree 
> of tessellated triangles provided by the {{Tessellator}}. Furthermore, this 
> tessellation is already computed at index time. If we create an on-disk 
> representation of {{ComponentTree}} 's binary tree of tessellated triangles 
> and use this as the doc value {{binaryValue}} format we will be able to 
> efficiently compute spatial relations with this binary representation and 
> achieve the same facet/aggregation result over shapes as we can with points 
> today (e.g., grid facets, centroid, area, etc).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] vigyasharma opened a new pull request, #68: Add vigya to committer list

2022-08-09 Thread GitBox


vigyasharma opened a new pull request, #68:
URL: https://github.com/apache/lucene-site/pull/68

   Add vigyasharma to list of committers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] vigyasharma merged pull request #68: Add vigya to committer list

2022-08-09 Thread GitBox


vigyasharma merged PR #68:
URL: https://github.com/apache/lucene-site/pull/68


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] vigyasharma opened a new pull request, #69: Add Vigya Sharma to whoweare

2022-08-09 Thread GitBox


vigyasharma opened a new pull request, #69:
URL: https://github.com/apache/lucene-site/pull/69

   Add Vigya Sharma to list of committers


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] vigyasharma merged pull request #69: Add Vigya Sharma to whoweare

2022-08-09 Thread GitBox


vigyasharma merged PR #69:
URL: https://github.com/apache/lucene-site/pull/69


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma merged pull request #948: Fix typo in PostingsReaderBase docstring

2022-08-09 Thread GitBox


vigyasharma merged PR #948:
URL: https://github.com/apache/lucene/pull/948


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-08-09 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-10557:
---
Attachment: (was: lucene_duke_surf.png)

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

2022-08-09 Thread Michael Wechner (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577811#comment-17577811
 ] 

Michael Wechner commented on LUCENE-10471:
--

Maybe I do not understand the code base of Lucene well enough, but wouldn't it 
be possible to have a default limit of 1024 or 2028 and that one can set a 
different limit programmable on the IndexWriter/Reader/Searcher?

> Increase the number of dims for KNN vectors to 2048
> ---
>
> Key: LUCENE-10471
> URL: https://issues.apache.org/jira/browse/LUCENE-10471
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Mayya Sharipova
>Priority: Trivial
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current maximum allowed number of dimensions is equal to 1024. But we see 
> in practice a couple well-known models that produce vectors with > 1024 
> dimensions (e.g 
> [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1]
>  uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing 
> max dims to `2048` will satisfy these use cases.
> I am wondering if anybody has strong objections against this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org