On Mon, 2014-04-07 at 13:52 +0200, Jonathan Varsanik wrote:
> Do you mean to tell me that the people on this list that are indexing
> 100s of millions of documents are doing this over http?

Some of us do. Our net archive indexer runs a lot of Tika processes that
sends their analysed documents through http. We're building 1TB indexes
of about 3-400M documents each. The Tika-analysis is by far the heavy
part of the setup: 1 Solr instance easily keeps up with 30 Tikas on a 24
core machine (or 48, depending on how you count). This setup makes it
easy to scale up & out, basically by starting new Tika processes on
whatever machines we have available.

In other setups, where the pre-index analysis is lighter, the choice of
transport layer might matter more. As always, optimize where it it
needed.

- Toke Eskildsen, State and University Library, Denmark


Reply via email to