Here is an example of schema design: a PDF file of 5MB might have maybe 50k of actual text. The Solr ExtractingRequestHandler will find that text and only index that. If you set the field to stored=true, the 5mb will be saved. If saved=false, the PDF is not saved. Instead, you would store a link to it.
One problem with indexing is that Solr continally copies data into "segments" (index parts) while you index. So, each 5MB PDF might get copied 50 times during a full index job. If you can strip the index down to what you really want to search on, terabytes become gigabytes. Solr seems to handle 100g-200g fine on modern hardware. Lance On Fri, Dec 23, 2011 at 1:54 AM, Nick Vincent <n...@vtype.com> wrote: > For data of this size you may want to look at something like Apache > Cassandra, which is made specifically to handle data at this kind of > scale across many machines. > > You can still use Hadoop to analyse and transform the data in a > performant manner, however it's probably best to do some research on > this on the relevant technical forums for those technologies. > > Nick -- Lance Norskog goks...@gmail.com