Re: Indexing Twitter - Hypothetical

Jack Krupansky Thu, 03 Mar 2016 12:52:36 -0800

As always, the initial question always needs to be how you wish to query
the data - query will drive the data model. I don't  want to put words in
your mouth as to your query requirements, so... clue us in on your query
requirements.


-- Jack Krupansky

On Thu, Mar 3, 2016 at 2:25 PM, Toke Eskildsen <t...@statsbiblioteket.dk>
wrote:

> Joseph Obernberger <joseph.obernber...@gmail.com> wrote:
> > Hi All - would it be reasonable to index the Twitter 'firehose' with Solr
> > Cloud - roughly 500-600 million docs per day indexing each of the fields
> > (about 180)?
>
> Possible, yes. Reasonable? It is not going to be cheap.
>
> Twitter index the tweets themselves and have been quite open about how
> they do it. I would suggest looking for their presentations; slides or
> recordings. They have presented at Berlin Buzzwords and Lucene/Solr
> Revolution and probably elsewhere too. The gist is that they have done a
> lot of work and custom coding to handle it.
>
> > If I were to guess at a sharded setup to handle such data, and keep 2
> years
> > worth, I would guess about 2500 shards.  Is that reasonable?
>
> I think you need to think well beyond standard SolrCloud setups. Even if
> you manage to get 2500 shards running, you will want to do a lot of
> tweaking on the way to issue queries so that each request does not require
> all 2500 shards to be searched. Prioritizing newer material and only query
> the older shards if there is not enough resent results is an example.
>
> I highly doubt that a single SolrCloud is the best answer here. Maybe one
> cloud for each month and a lot of external logic?
>
> - Toke Eskildsen
>

Re: Indexing Twitter - Hypothetical

Reply via email to