mbien opened a new pull request, #302: URL: https://github.com/apache/maven-indexer/pull/302
this is a draft as talking point, it does two things: - uses `IndexUpdateRequest` as configuration for `IndexDataReader` (tmp folder, factory etc) - moves the filtering from post extraction to the read phase - makes filtering really fast (and multi threaded too), it is no longer an extra step - has actually an effect on the on-disk index size (since I've learned lucene doesn't really remove things since all files are immutable) I do realize that this is not quite the same behavior as before. To get the exact same behavior back we could add this as additional filter, one during read (new), one after extraction (old). example filter: ```java final Instant cutoff = ZonedDateTime.now().minusYears(2).toInstant(); iur.setDocumentFilter((doc) -> { IndexableField field = doc.getField("m"); // usually never null return field != null && Instant.ofEpochMilli(Long.parseLong(field.stringValue())).isAfter(cutoff); }); ``` results (single threaded, since MT has a index size penalty due to merge overhead): ``` full: 5.6 GB 2y: 2.6 GB 1y: 1.4 GB ``` I did also try to remove some fields from within the filter (e.g description), this had however no impact at all (but again, this is probably just me not understanding lucene). Intuition wise, 1.4 GB for one year of maven artifacts sounds still a bit more than it should be. `mvn spotless:apply` is responsible for the formatting would fix MINDEXER-185 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@maven.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org