On Mon, 2014-04-07 at 13:52 +0200, Jonathan Varsanik wrote: > Do you mean to tell me that the people on this list that are indexing > 100s of millions of documents are doing this over http?
Some of us do. Our net archive indexer runs a lot of Tika processes that sends their analysed documents through http. We're building 1TB indexes of about 3-400M documents each. The Tika-analysis is by far the heavy part of the setup: 1 Solr instance easily keeps up with 30 Tikas on a 24 core machine (or 48, depending on how you count). This setup makes it easy to scale up & out, basically by starting new Tika processes on whatever machines we have available. In other setups, where the pre-index analysis is lighter, the choice of transport layer might matter more. As always, optimize where it it needed. - Toke Eskildsen, State and University Library, Denmark