[ 
https://issues.apache.org/jira/browse/LUCENE-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099293#comment-17099293
 ] 

Adrien Grand commented on LUCENE-9148:
--------------------------------------

Not one file per field, it would be horrible. :) My current prototype has 3 
files:
 - One meta file that is fully read when opening the index. It contains 
metadata about the field like number of dimensions, and offsets into the index 
and data files.
 - An index file that stores the inner nodes of the BKD tree.
 - A data file that stores the leaf nodes.

The motivation for splitting the index and data files is that they have 
different access patterns. For instance finding nearest neighbors is pretty 
intense on the index, and I believe some users might want to keep it in RAM so 
having it in a different file from the data file will help users leverage 
MmapDirectory#setPreload and FileSwitchDirectory to do so.

We could go without a meta file by storing its content at the beginning of the 
index or data file, but a separate file makes the write logic easier since 
there is no need to buffer a lot of content before writing and helps get better 
error messages in case of corruption since we can verify the content of the 
file against a checksum when opening the index, which avoids e.g. trying to 
create slices with out-of-bounds offsets.


> Move the BKD index to its own file.
> -----------------------------------
>
>                 Key: LUCENE-9148
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9148
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Lucene60PointsWriter stores both inner nodes and leaf nodes in the same file, 
> interleaved. For instance if you have two fields, you would have 
> {{<leaf_nodes_A, inner_nodes_A, leaf_nodes_B, inner_nodes_B>}}. It's not 
> ideal since leaves and inner nodes have quite different access patterns. 
> Should we split this into two files? In the case when the BKD index is 
> off-heap, this would also help force it into RAM with 
> {{MMapDirectory#setPreload}}.
> Note that Lucene60PointsFormat already has a file that it calls "index" but 
> it's really only about mapping fields to file pointers in the other file and 
> not what I'm discussing here. But we could possibly store the BKD indices in 
> this existing file if we want to avoid creating a new one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to