On 03/03/2016 19:25, Toke Eskildsen wrote:
Joseph Obernberger <joseph.obernber...@gmail.com> wrote:
Hi All - would it be reasonable to index the Twitter 'firehose'
with Solr Cloud - roughly 500-600 million docs per day indexing
each of the fields (about 180)?
Possible, yes. Reasonable? It is not going to be cheap.
Twitter index the tweets themselves and have been quite open about
how they do it. I would suggest looking for their presentations;
slides or recordings. They have presented at Berlin Buzzwords and
Lucene/Solr Revolution and probably elsewhere too. The gist is that
they have done a lot of work and custom coding to handle it.
As I recall they're not using Solr, but rather an in-house layer built
on a customised version of Lucene. They're indexing around half a
trillion tweets.
If the idea is to provide a searchable archive of all tweets, my first
question would be 'why': if the idea is to monitor new tweets for
particular patterns there are better ways to do this (Luwak for example).
Charlie
If I were to guess at a sharded setup to handle such data, and keep
2 years worth, I would guess about 2500 shards. Is that
reasonable?
I think you need to think well beyond standard SolrCloud setups. Even
if you manage to get 2500 shards running, you will want to do a lot
of tweaking on the way to issue queries so that each request does not
require all 2500 shards to be searched. Prioritizing newer material
and only query the older shards if there is not enough resent results
is an example.
I highly doubt that a single SolrCloud is the best answer here. Maybe
one cloud for each month and a lot of external logic?
- Toke Eskildsen
--
Charlie Hull
Flax - Open Source Enterprise Search
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk