mbien opened a new pull request, #302:
URL: https://github.com/apache/maven-indexer/pull/302

   this is a draft as talking point, it does two things:
    - uses `IndexUpdateRequest` as configuration for `IndexDataReader` (tmp 
folder, factory etc)
    - moves the filtering from post extraction to the read phase
      - makes filtering really fast (and multi threaded too), it is no longer 
an extra step
      - has actually an effect on the on-disk index size (since I've learned 
lucene doesn't really remove things since all files are immutable)
    
   I do realize that this is not quite the same behavior as before. To get the 
exact same behavior back we could add this as additional filter, one during 
read (new), one after extraction (old).
   
   example filter:
   ```java
         final Instant cutoff = ZonedDateTime.now().minusYears(2).toInstant();
         iur.setDocumentFilter((doc) -> {
             IndexableField field = doc.getField("m"); // usually never null
             return field != null && 
Instant.ofEpochMilli(Long.parseLong(field.stringValue())).isAfter(cutoff);
         });
   ```
   results (single threaded, since MT has a index size penalty due to merge 
overhead):
   ```
   full: 5.6 GB
   2y: 2.6 GB
   1y: 1.4 GB
   ```
   I did also try to remove some fields from within the filter (e.g 
description), this had however no impact at all (but again, this is probably 
just me not understanding lucene). Intuition wise, 1.4 GB for one year of maven 
artifacts sounds still a bit more than it should be.
   
   
    `mvn spotless:apply` is responsible for the formatting
   
   would fix MINDEXER-185


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@maven.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to