Context: Solr/Lucene 5.1 Is there a way to determine documents that occupy alot "space" in the index. As I don't store any fields that have text, it must be the terms extracted from the documents occupying the space.
So my question is: which documents occupy a most space in the inverted index? Context: I index approx 7000pdfs (extracted with tika) into my index. I suspect that for some pdf's the extarcted text is not really text but "binary blobs". In order to verify this (and possibly omit these pdfs) I hope to get some hints of Solr/Lucene ;)