As always, the initial question always needs to be how you wish to query the data - query will drive the data model. I don't want to put words in your mouth as to your query requirements, so... clue us in on your query requirements.
-- Jack Krupansky On Thu, Mar 3, 2016 at 2:25 PM, Toke Eskildsen <t...@statsbiblioteket.dk> wrote: > Joseph Obernberger <joseph.obernber...@gmail.com> wrote: > > Hi All - would it be reasonable to index the Twitter 'firehose' with Solr > > Cloud - roughly 500-600 million docs per day indexing each of the fields > > (about 180)? > > Possible, yes. Reasonable? It is not going to be cheap. > > Twitter index the tweets themselves and have been quite open about how > they do it. I would suggest looking for their presentations; slides or > recordings. They have presented at Berlin Buzzwords and Lucene/Solr > Revolution and probably elsewhere too. The gist is that they have done a > lot of work and custom coding to handle it. > > > If I were to guess at a sharded setup to handle such data, and keep 2 > years > > worth, I would guess about 2500 shards. Is that reasonable? > > I think you need to think well beyond standard SolrCloud setups. Even if > you manage to get 2500 shards running, you will want to do a lot of > tweaking on the way to issue queries so that each request does not require > all 2500 shards to be searched. Prioritizing newer material and only query > the older shards if there is not enough resent results is an example. > > I highly doubt that a single SolrCloud is the best answer here. Maybe one > cloud for each month and a lot of external logic? > > - Toke Eskildsen >