Re: Indexing Twitter - Hypothetical

Jack Krupansky Tue, 08 Mar 2016 09:35:36 -0800

You have my permission... and blessing... and... condolences!

BTW, our usual recommendation is to do a subset proof of concept to see how
all the pieces come together and then calculate the scaling from there.
IOW, go ahead and index a day, a week, a month from the firehose and see
how many nodes, RAM, and SSD that takes and scale from there, although
estimating by more than a factor of ten is problematic given nonlinear
effects.



-- Jack Krupansky

On Tue, Mar 8, 2016 at 11:50 AM, Joseph Obernberger <
joseph.obernber...@gmail.com> wrote:

> Thank you for the links and explanation.  We are using GATE (General
> Architecture for Text Engineering) and parts of the Stanford NER/Parser for
> the data that we ingest, but we do not apply it to the queries - only the
> data.  We've been concentrating on the back-end, and analytics, not so much
> what comes in for queries; something that we need to address.  For this
> hypothetical, I wanted to get ideas on what questions would need to be
> asked, and how large the system would need to be.  Thank you all very much
> for the information so far!
> Jack - I want to be a guru-level Solr expert.  :)
>
> -Joe
>
> On Sun, Mar 6, 2016 at 1:29 PM, Walter Underwood <wun...@wunderwood.org>
> wrote:
>
> > This is a very good presentation on using entity extraction in query
> > understanding. As you’ll see from the preso, it is not easy.
> >
> >
> >
> http://www.slideshare.net/dtunkelang/better-search-through-query-understanding
> > <
> >
> http://www.slideshare.net/dtunkelang/better-search-through-query-understanding
> > >
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> > > On Mar 6, 2016, at 7:27 AM, Jack Krupansky <jack.krupan...@gmail.com>
> > wrote:
> > >
> > > Back to the original question... there are two answers:
> > >
> > > 1. Yes - for guru-level Solr experts.
> > > 2. No - for anybody else.
> > >
> > > For starters, (as always), you would need to do a lot more upfront work
> > on
> > > mapping out the forms of query which will be supported. For example, is
> > > your focus on precision or recall. And, are you looking to analyze all
> > > matching tweets or just a sample. And, the load, throughput, and
> latency
> > > requirements. And, any spatial search requirements. And, any entity
> > search
> > > requirements. Without a clear view of the query requirements it simply
> > > isn't possible to even begin defining a data model. And without a data
> > > model, indexing is a fool's errand. In short, no focus, no progress.
> > >
> > > -- Jack Krupansky
> > >
> > > On Sun, Mar 6, 2016 at 7:42 AM, Susheel Kumar <susheel2...@gmail.com>
> > wrote:
> > >
> > >> Entity Recognition means you may want to recognize different entities
> > >> name/person, email, location/city/state/country etc. in your
> > >> tweets/messages with goal of  providing better relevant results to
> > users.
> > >> NER can be used at query or indexing (data enrichment) time.
> > >>
> > >> Thanks,
> > >> Susheel
> > >>
> > >> On Fri, Mar 4, 2016 at 7:55 PM, Joseph Obernberger <
> > >> joseph.obernber...@gmail.com> wrote:
> > >>
> > >>> Thank you all very much for all the responses so far.  I've enjoyed
> > >> reading
> > >>> them!  We have noticed that storing data inside of Solr results in
> > >>> significantly worse performance (particularly faceting); so we store
> > the
> > >>> values of all the fields elsewhere, but index all the data with Solr
> > >>> Cloud.  I think the suggestion about splitting the data up into
> blocks
> > of
> > >>> date/time is where we would be headed.  Having two Solr-Cloud
> clusters
> > -
> > >>> one to handle ~30 days of data, and one to handle historical.
> Another
> > >>> option is to use a single Solr Cloud cluster, but use multiple
> > >>> cores/collections.  Either way you'd need a job to come through and
> > clean
> > >>> up old data. The historical cluster would have much worse
> performance,
> > >>> particularly for clustering and faceting the data, but that may be
> > >>> acceptable.
> > >>> I don't know what you mean by 'entity recognition in the queries' -
> > could
> > >>> you elaborate?
> > >>>
> > >>> We would want to index and potentially facet on any of the fields -
> for
> > >>> example entities_media_url, username, even background color, but we
> do
> > >> not
> > >>> know a-priori what fields will be important to users.
> > >>> As to why we would want to make the data searchable; well - I don't
> > make
> > >>> the rules!  Tweets is not the only data source, but it's certainly
> the
> > >>> largest that we are currently looking at handling.
> > >>>
> > >>> I will read up on the Berlin Buzzwords - thank you for the info!
> > >>>
> > >>> -Joe
> > >>>
> > >>>
> > >>>
> > >>> On Fri, Mar 4, 2016 at 9:59 AM, Jack Krupansky <
> > jack.krupan...@gmail.com
> > >>>
> > >>> wrote:
> > >>>
> > >>>> As always, the initial question is how you intend to query the data
> -
> > >>> query
> > >>>> drives data modeling. How real-time do you need queries to be? How
> > fast
> > >>> do
> > >>>> you need archive queries to be? How many fields do you need to query
> > >> on?
> > >>>> How much entity recognition do you need in queries?
> > >>>>
> > >>>>
> > >>>> -- Jack Krupansky
> > >>>>
> > >>>> On Fri, Mar 4, 2016 at 4:19 AM, Charlie Hull <char...@flax.co.uk>
> > >> wrote:
> > >>>>
> > >>>>> On 03/03/2016 19:25, Toke Eskildsen wrote:
> > >>>>>
> > >>>>>> Joseph Obernberger <joseph.obernber...@gmail.com> wrote:
> > >>>>>>
> > >>>>>>> Hi All - would it be reasonable to index the Twitter 'firehose'
> > >>>>>>> with Solr Cloud - roughly 500-600 million docs per day indexing
> > >>>>>>> each of the fields (about 180)?
> > >>>>>>>
> > >>>>>>
> > >>>>>> Possible, yes. Reasonable? It is not going to be cheap.
> > >>>>>>
> > >>>>>> Twitter index the tweets themselves and have been quite open about
> > >>>>>> how they do it. I would suggest looking for their presentations;
> > >>>>>> slides or recordings. They have presented at Berlin Buzzwords and
> > >>>>>> Lucene/Solr Revolution and probably elsewhere too. The gist is
> that
> > >>>>>> they have done a lot of work and custom coding to handle it.
> > >>>>>>
> > >>>>>
> > >>>>> As I recall they're not using Solr, but rather an in-house layer
> > >> built
> > >>> on
> > >>>>> a customised version of Lucene. They're indexing around half a
> > >> trillion
> > >>>>> tweets.
> > >>>>>
> > >>>>> If the idea is to provide a searchable archive of all tweets, my
> > >> first
> > >>>>> question would be 'why': if the idea is to monitor new tweets for
> > >>>>> particular patterns there are better ways to do this (Luwak for
> > >>> example).
> > >>>>>
> > >>>>> Charlie
> > >>>>>
> > >>>>>
> > >>>>>> If I were to guess at a sharded setup to handle such data, and
> keep
> > >>>>>>> 2 years worth, I would guess about 2500 shards.  Is that
> > >>>>>>> reasonable?
> > >>>>>>>
> > >>>>>>
> > >>>>>> I think you need to think well beyond standard SolrCloud setups.
> > >> Even
> > >>>>>> if you manage to get 2500 shards running, you will want to do a
> lot
> > >>>>>> of tweaking on the way to issue queries so that each request does
> > >> not
> > >>>>>> require all 2500 shards to be searched. Prioritizing newer
> material
> > >>>>>> and only query the older shards if there is not enough resent
> > >> results
> > >>>>>> is an example.
> > >>>>>>
> > >>>>>> I highly doubt that a single SolrCloud is the best answer here.
> > >> Maybe
> > >>>>>> one cloud for each month and a lot of external logic?
> > >>>>>>
> > >>>>>> - Toke Eskildsen
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Charlie Hull
> > >>>>> Flax - Open Source Enterprise Search
> > >>>>>
> > >>>>> tel/fax: +44 (0)8700 118334
> > >>>>> mobile:  +44 (0)7767 825828
> > >>>>> web: www.flax.co.uk
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> >
>

Re: Indexing Twitter - Hypothetical

Reply via email to