Re: determine "big" documents in the index?

Erick Erickson Fri, 08 May 2015 09:57:58 -0700

Oops, this may be a better link: http://lucidworks.com/blog/indexing-with-solrj/


On Fri, May 8, 2015 at 9:55 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> bq: has 30'860'099 terms. Is this "too much"
>
> Depends on how you indexed it. If you used shingles, then maybe, maybe
> not. If you just do normal text analysis, it's suspicious to say the
> least. There are about 300K words in the English language and you have
> 100X that. So either
> 1> you have a lot of legitimately unique terms, say part numbers,
> SKUs, etc. digits analyzed as text, whatever.
> 2> you have a lot of garbage in your input. OCR is notorious for this,
> as are binary blobs.
>
> The TermsComponent is your friend, it'll allow you to get an idea of
> what the actual terms are, it does take a bit of poking around though.
>
> There's no good way I know of to know which docs are taking up space
> in the index. What I'd probably do is use Tika in a SolrJ client and
> look at the data as I sent it, here's a place to start:
> https://lucidworks.com/blog/dev/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Fri, May 8, 2015 at 7:30 AM, Clemens Wyss DEV <clemens...@mysign.ch> wrote:
>> On one of my fields (the "phrase suggestion" field) has 30'860'099 terms. Is 
>> this "too much"?
>> Another field (the "single word suggestion") has 2'156'218 terms.
>>
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
>> Gesendet: Freitag, 8. Mai 2015 15:54
>> An: solr-user@lucene.apache.org
>> Betreff: determine "big" documents in the index?
>>
>> Context: Solr/Lucene 5.1
>>
>> Is there a way to determine documents that occupy alot "space" in the index. 
>> As I don't store any fields that have text, it must be the terms extracted 
>> from the documents occupying the space.
>>
>> So my question is: which documents occupy a most space in the inverted index?
>>
>> Context:
>> I index approx 7000pdfs (extracted with tika) into my index. I suspect that 
>> for some pdf's the extarcted text is not really text but "binary blobs". In 
>> order to verify this (and possibly omit these pdfs) I hope to get some hints 
>> of Solr/Lucene ;)

Re: determine "big" documents in the index?

Reply via email to