Try running PDF through standalone Tika and see what comes back. That's the
size of the input. It usually be quite a small proportion of PDF size.
Possibly down to metadata only and no text, if your PDF does not include
text layer.

Then, it depends on your storing and indexing options, your tokenizers,
whether you are using ngrams, synonyms or anything else that multiplies the
content. And so on.

And remember, that you need (2? 3?) times more space on disk than a single
index for when Solr does segment merges.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 10, 2014 at 1:55 AM, Amit Jha <shanuu....@gmail.com> wrote:

> Hi,
>
> I would like to know if I index a file I.e PDF of 100KB then what would be
> the size of index. What all factors should be consider to determine the
> disk size?
>
> Rgds
> AJ

Reply via email to