[ 
https://issues.apache.org/jira/browse/MINDEXER-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699390#comment-17699390
 ] 

Michael Bien edited comment on MINDEXER-185 at 3/12/23 7:39 PM:
----------------------------------------------------------------

i was reading up on lucene yesterday and I entirely forgot that a key point of 
their data structure is that it is entirely immutable! This means deleting a 
doc won't delete anything - all it does is to set a flag, it is updated later 
while queries run during segment merges.

Feel free to close this issue, however, I believe it is worth investigating if 
the filter could be run in the reader itself while it is building the index, 
this should hopefully have an actual effect on the resulting index size.

 

edit: maybe the filter could be even used to throw away data within the docs, 
which would be possible at that stage I believe. E.g I have the suspicion that 
the major contribution to the index size is because some artifacts put their 
documentation into the description field (but this is only a guess at this 
point - not verified).


was (Author: mbien):
i was reading up on lucene yesterday and I entirely forgot that a key point of 
their data structure is that it is entirely immutable! This means deleting a 
doc won't delete anything - all it does is to set a flag, it is updated later 
while queries run during segment merges.

Feel free to close this issue, however, I believe it is worth investigating if 
the filter could be run in the reader itself while it is building the index, 
this should hopefully have an actual effect on the resulting index size.

> Document filter doesn't seem to do anything
> -------------------------------------------
>
>                 Key: MINDEXER-185
>                 URL: https://issues.apache.org/jira/browse/MINDEXER-185
>             Project: Maven Indexer
>          Issue Type: Bug
>    Affects Versions: 7.0.1
>            Reporter: Michael Bien
>            Priority: Major
>
> Hello devs!
>  
> I tried to filter the index during extraction using a DocumentFilter and it 
> didn't appear to do anything.
> As test, I simply set {{indexUpdateRequest.setDocumentFilter(doc -> false);}} 
> before calling {{DefaultIndexUpdater.fetchAndUpdateIndex}} and the extracted 
> index had the same size of 5.6gb as without the filter.
>  
> The filter is actually called and it does also add a few minutes to the 
> extraction time.
> https://github.com/apache/maven-indexer/blob/1cd122b1487150613005c8f9aced9aec20fded3e/indexer-core/src/main/java/org/apache/maven/index/updater/DefaultIndexUpdater.java#L238-L241
>  
> I am not sure why the implementation is filtering the index *after* 
> extraction. Wouldn't it be easier and also more efficient to do it in 
> IndexDataReader?
> e.g 
> https://github.com/apache/maven-indexer/blob/1cd122b1487150613005c8f9aced9aec20fded3e/indexer-core/src/main/java/org/apache/maven/index/updater/IndexDataReader.java#L269



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to