AW: determine "big" documents in the index?

Clemens Wyss DEV Sat, 09 May 2015 00:12:13 -0700

> If you used shingles
I do:
    <fieldType class="solr.TextField" name="suggest_phrase" 
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="3" 
outputUnigrams="true"/>    
      </analyzer>
    </fieldType>

>http://lucidworks.com/blog/indexing-with-solrj/
This is more or less what I do

>2> you have a lot of garbage in your input. 
>OCR is notorious for this,as are binary blobs.
What does the AutodetectParser return in case of an OCR-Pdf? Can I 
"detect"/omit an OCR pdf?

-----Ursprüngliche Nachricht-----
Von: Erick Erickson [mailto:erickerick...@gmail.com] 
Gesendet: Freitag, 8. Mai 2015 18:55
An: solr-user@lucene.apache.org
Betreff: Re: determine "big" documents in the index?

bq: has 30'860'099 terms. Is this "too much"

Depends on how you indexed it. If you used shingles, then maybe, maybe not. If 
you just do normal text analysis, it's suspicious to say the least. There are 
about 300K words in the English language and you have 100X that. So either
1> you have a lot of legitimately unique terms, say part numbers,
SKUs, etc. digits analyzed as text, whatever.
2> you have a lot of garbage in your input. OCR is notorious for this,
as are binary blobs.

The TermsComponent is your friend, it'll allow you to get an idea of what the 
actual terms are, it does take a bit of poking around though.

There's no good way I know of to know which docs are taking up space in the 
index. What I'd probably do is use Tika in a SolrJ client and look at the data 
as I sent it, here's a place to start:
https://lucidworks.com/blog/dev/2012/02/14/indexing-with-solrj/

Best,
Erick

On Fri, May 8, 2015 at 7:30 AM, Clemens Wyss DEV <clemens...@mysign.ch> wrote:
> On one of my fields (the "phrase suggestion" field) has 30'860'099 terms. Is 
> this "too much"?
> Another field (the "single word suggestion") has 2'156'218 terms.
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
> Gesendet: Freitag, 8. Mai 2015 15:54
> An: solr-user@lucene.apache.org
> Betreff: determine "big" documents in the index?
>
> Context: Solr/Lucene 5.1
>
> Is there a way to determine documents that occupy alot "space" in the index. 
> As I don't store any fields that have text, it must be the terms extracted 
> from the documents occupying the space.
>
> So my question is: which documents occupy a most space in the inverted index?
>
> Context:
> I index approx 7000pdfs (extracted with tika) into my index. I suspect 
> that for some pdf's the extarcted text is not really text but "binary 
> blobs". In order to verify this (and possibly omit these pdfs) I hope 
> to get some hints of Solr/Lucene ;)

AW: determine "big" documents in the index?

Reply via email to