As always, the initial question is how you intend to query the data - query drives data modeling. How real-time do you need queries to be? How fast do you need archive queries to be? How many fields do you need to query on? How much entity recognition do you need in queries?
-- Jack Krupansky On Fri, Mar 4, 2016 at 4:19 AM, Charlie Hull <char...@flax.co.uk> wrote: > On 03/03/2016 19:25, Toke Eskildsen wrote: > >> Joseph Obernberger <joseph.obernber...@gmail.com> wrote: >> >>> Hi All - would it be reasonable to index the Twitter 'firehose' >>> with Solr Cloud - roughly 500-600 million docs per day indexing >>> each of the fields (about 180)? >>> >> >> Possible, yes. Reasonable? It is not going to be cheap. >> >> Twitter index the tweets themselves and have been quite open about >> how they do it. I would suggest looking for their presentations; >> slides or recordings. They have presented at Berlin Buzzwords and >> Lucene/Solr Revolution and probably elsewhere too. The gist is that >> they have done a lot of work and custom coding to handle it. >> > > As I recall they're not using Solr, but rather an in-house layer built on > a customised version of Lucene. They're indexing around half a trillion > tweets. > > If the idea is to provide a searchable archive of all tweets, my first > question would be 'why': if the idea is to monitor new tweets for > particular patterns there are better ways to do this (Luwak for example). > > Charlie > > >> If I were to guess at a sharded setup to handle such data, and keep >>> 2 years worth, I would guess about 2500 shards. Is that >>> reasonable? >>> >> >> I think you need to think well beyond standard SolrCloud setups. Even >> if you manage to get 2500 shards running, you will want to do a lot >> of tweaking on the way to issue queries so that each request does not >> require all 2500 shards to be searched. Prioritizing newer material >> and only query the older shards if there is not enough resent results >> is an example. >> >> I highly doubt that a single SolrCloud is the best answer here. Maybe >> one cloud for each month and a lot of external logic? >> >> - Toke Eskildsen >> >> > > -- > Charlie Hull > Flax - Open Source Enterprise Search > > tel/fax: +44 (0)8700 118334 > mobile: +44 (0)7767 825828 > web: www.flax.co.uk >