Joseph Obernberger <joseph.obernber...@gmail.com> wrote:
> Hi All - would it be reasonable to index the Twitter 'firehose' with Solr
> Cloud - roughly 500-600 million docs per day indexing each of the fields
> (about 180)?

Possible, yes. Reasonable? It is not going to be cheap.

Twitter index the tweets themselves and have been quite open about how they do 
it. I would suggest looking for their presentations; slides or recordings. They 
have presented at Berlin Buzzwords and Lucene/Solr Revolution and probably 
elsewhere too. The gist is that they have done a lot of work and custom coding 
to handle it.

> If I were to guess at a sharded setup to handle such data, and keep 2 years
> worth, I would guess about 2500 shards.  Is that reasonable?

I think you need to think well beyond standard SolrCloud setups. Even if you 
manage to get 2500 shards running, you will want to do a lot of tweaking on the 
way to issue queries so that each request does not require all 2500 shards to 
be searched. Prioritizing newer material and only query the older shards if 
there is not enough resent results is an example.

I highly doubt that a single SolrCloud is the best answer here. Maybe one cloud 
for each month and a lot of external logic?

- Toke Eskildsen

Reply via email to