AW: determine "big" documents in the index?

Clemens Wyss DEV Fri, 08 May 2015 07:33:14 -0700

On one of my fields (the "phrase suggestion" field) has 30'860'099 terms. Is 
this "too much"?
Another field (the "single word suggestion") has 2'156'218 terms.




-----Ursprüngliche Nachricht-----
Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] 
Gesendet: Freitag, 8. Mai 2015 15:54
An: solr-user@lucene.apache.org
Betreff: determine "big" documents in the index?

Context: Solr/Lucene 5.1

Is there a way to determine documents that occupy alot "space" in the index. As 
I don't store any fields that have text, it must be the terms extracted from 
the documents occupying the space. 

So my question is: which documents occupy a most space in the inverted index?

Context:
I index approx 7000pdfs (extracted with tika) into my index. I suspect that for 
some pdf's the extarcted text is not really text but "binary blobs". In order 
to verify this (and possibly omit these pdfs) I hope to get some hints of 
Solr/Lucene ;)

AW: determine "big" documents in the index?

Reply via email to