Oops, this may be a better link: http://lucidworks.com/blog/indexing-with-solrj/
On Fri, May 8, 2015 at 9:55 AM, Erick Erickson <erickerick...@gmail.com> wrote: > bq: has 30'860'099 terms. Is this "too much" > > Depends on how you indexed it. If you used shingles, then maybe, maybe > not. If you just do normal text analysis, it's suspicious to say the > least. There are about 300K words in the English language and you have > 100X that. So either > 1> you have a lot of legitimately unique terms, say part numbers, > SKUs, etc. digits analyzed as text, whatever. > 2> you have a lot of garbage in your input. OCR is notorious for this, > as are binary blobs. > > The TermsComponent is your friend, it'll allow you to get an idea of > what the actual terms are, it does take a bit of poking around though. > > There's no good way I know of to know which docs are taking up space > in the index. What I'd probably do is use Tika in a SolrJ client and > look at the data as I sent it, here's a place to start: > https://lucidworks.com/blog/dev/2012/02/14/indexing-with-solrj/ > > Best, > Erick > > On Fri, May 8, 2015 at 7:30 AM, Clemens Wyss DEV <clemens...@mysign.ch> wrote: >> On one of my fields (the "phrase suggestion" field) has 30'860'099 terms. Is >> this "too much"? >> Another field (the "single word suggestion") has 2'156'218 terms. >> >> >> >> -----Ursprüngliche Nachricht----- >> Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] >> Gesendet: Freitag, 8. Mai 2015 15:54 >> An: solr-user@lucene.apache.org >> Betreff: determine "big" documents in the index? >> >> Context: Solr/Lucene 5.1 >> >> Is there a way to determine documents that occupy alot "space" in the index. >> As I don't store any fields that have text, it must be the terms extracted >> from the documents occupying the space. >> >> So my question is: which documents occupy a most space in the inverted index? >> >> Context: >> I index approx 7000pdfs (extracted with tika) into my index. I suspect that >> for some pdf's the extarcted text is not really text but "binary blobs". In >> order to verify this (and possibly omit these pdfs) I hope to get some hints >> of Solr/Lucene ;)