Here is an example of schema design: a PDF file of 5MB might have
maybe 50k of actual text. The Solr ExtractingRequestHandler will find
that text and only index that. If you set the field to stored=true,
the 5mb will be saved. If saved=false, the PDF is not saved. Instead,
you would store a link to it.

One problem with indexing is that Solr continally copies data into
"segments" (index parts) while you index. So, each 5MB PDF might get
copied 50 times during a full index job. If you can strip the index
down to what you really want to search on, terabytes become gigabytes.
Solr seems to handle 100g-200g fine on modern hardware.

Lance

On Fri, Dec 23, 2011 at 1:54 AM, Nick Vincent <n...@vtype.com> wrote:
> For data of this size you may want to look at something like Apache
> Cassandra, which is made specifically to handle data at this kind of
> scale across many machines.
>
> You can still use Hadoop to analyse and transform the data in a
> performant manner, however it's probably best to do some research on
> this on the relevant technical forums for those technologies.
>
> Nick



-- 
Lance Norskog
goks...@gmail.com

Reply via email to