You have my permission... and blessing... and... condolences! BTW, our usual recommendation is to do a subset proof of concept to see how all the pieces come together and then calculate the scaling from there. IOW, go ahead and index a day, a week, a month from the firehose and see how many nodes, RAM, and SSD that takes and scale from there, although estimating by more than a factor of ten is problematic given nonlinear effects.
-- Jack Krupansky On Tue, Mar 8, 2016 at 11:50 AM, Joseph Obernberger < joseph.obernber...@gmail.com> wrote: > Thank you for the links and explanation. We are using GATE (General > Architecture for Text Engineering) and parts of the Stanford NER/Parser for > the data that we ingest, but we do not apply it to the queries - only the > data. We've been concentrating on the back-end, and analytics, not so much > what comes in for queries; something that we need to address. For this > hypothetical, I wanted to get ideas on what questions would need to be > asked, and how large the system would need to be. Thank you all very much > for the information so far! > Jack - I want to be a guru-level Solr expert. :) > > -Joe > > On Sun, Mar 6, 2016 at 1:29 PM, Walter Underwood <wun...@wunderwood.org> > wrote: > > > This is a very good presentation on using entity extraction in query > > understanding. As you’ll see from the preso, it is not easy. > > > > > > > http://www.slideshare.net/dtunkelang/better-search-through-query-understanding > > < > > > http://www.slideshare.net/dtunkelang/better-search-through-query-understanding > > > > > > > wunder > > Walter Underwood > > wun...@wunderwood.org > > http://observer.wunderwood.org/ (my blog) > > > > > > > On Mar 6, 2016, at 7:27 AM, Jack Krupansky <jack.krupan...@gmail.com> > > wrote: > > > > > > Back to the original question... there are two answers: > > > > > > 1. Yes - for guru-level Solr experts. > > > 2. No - for anybody else. > > > > > > For starters, (as always), you would need to do a lot more upfront work > > on > > > mapping out the forms of query which will be supported. For example, is > > > your focus on precision or recall. And, are you looking to analyze all > > > matching tweets or just a sample. And, the load, throughput, and > latency > > > requirements. And, any spatial search requirements. And, any entity > > search > > > requirements. Without a clear view of the query requirements it simply > > > isn't possible to even begin defining a data model. And without a data > > > model, indexing is a fool's errand. In short, no focus, no progress. > > > > > > -- Jack Krupansky > > > > > > On Sun, Mar 6, 2016 at 7:42 AM, Susheel Kumar <susheel2...@gmail.com> > > wrote: > > > > > >> Entity Recognition means you may want to recognize different entities > > >> name/person, email, location/city/state/country etc. in your > > >> tweets/messages with goal of providing better relevant results to > > users. > > >> NER can be used at query or indexing (data enrichment) time. > > >> > > >> Thanks, > > >> Susheel > > >> > > >> On Fri, Mar 4, 2016 at 7:55 PM, Joseph Obernberger < > > >> joseph.obernber...@gmail.com> wrote: > > >> > > >>> Thank you all very much for all the responses so far. I've enjoyed > > >> reading > > >>> them! We have noticed that storing data inside of Solr results in > > >>> significantly worse performance (particularly faceting); so we store > > the > > >>> values of all the fields elsewhere, but index all the data with Solr > > >>> Cloud. I think the suggestion about splitting the data up into > blocks > > of > > >>> date/time is where we would be headed. Having two Solr-Cloud > clusters > > - > > >>> one to handle ~30 days of data, and one to handle historical. > Another > > >>> option is to use a single Solr Cloud cluster, but use multiple > > >>> cores/collections. Either way you'd need a job to come through and > > clean > > >>> up old data. The historical cluster would have much worse > performance, > > >>> particularly for clustering and faceting the data, but that may be > > >>> acceptable. > > >>> I don't know what you mean by 'entity recognition in the queries' - > > could > > >>> you elaborate? > > >>> > > >>> We would want to index and potentially facet on any of the fields - > for > > >>> example entities_media_url, username, even background color, but we > do > > >> not > > >>> know a-priori what fields will be important to users. > > >>> As to why we would want to make the data searchable; well - I don't > > make > > >>> the rules! Tweets is not the only data source, but it's certainly > the > > >>> largest that we are currently looking at handling. > > >>> > > >>> I will read up on the Berlin Buzzwords - thank you for the info! > > >>> > > >>> -Joe > > >>> > > >>> > > >>> > > >>> On Fri, Mar 4, 2016 at 9:59 AM, Jack Krupansky < > > jack.krupan...@gmail.com > > >>> > > >>> wrote: > > >>> > > >>>> As always, the initial question is how you intend to query the data > - > > >>> query > > >>>> drives data modeling. How real-time do you need queries to be? How > > fast > > >>> do > > >>>> you need archive queries to be? How many fields do you need to query > > >> on? > > >>>> How much entity recognition do you need in queries? > > >>>> > > >>>> > > >>>> -- Jack Krupansky > > >>>> > > >>>> On Fri, Mar 4, 2016 at 4:19 AM, Charlie Hull <char...@flax.co.uk> > > >> wrote: > > >>>> > > >>>>> On 03/03/2016 19:25, Toke Eskildsen wrote: > > >>>>> > > >>>>>> Joseph Obernberger <joseph.obernber...@gmail.com> wrote: > > >>>>>> > > >>>>>>> Hi All - would it be reasonable to index the Twitter 'firehose' > > >>>>>>> with Solr Cloud - roughly 500-600 million docs per day indexing > > >>>>>>> each of the fields (about 180)? > > >>>>>>> > > >>>>>> > > >>>>>> Possible, yes. Reasonable? It is not going to be cheap. > > >>>>>> > > >>>>>> Twitter index the tweets themselves and have been quite open about > > >>>>>> how they do it. I would suggest looking for their presentations; > > >>>>>> slides or recordings. They have presented at Berlin Buzzwords and > > >>>>>> Lucene/Solr Revolution and probably elsewhere too. The gist is > that > > >>>>>> they have done a lot of work and custom coding to handle it. > > >>>>>> > > >>>>> > > >>>>> As I recall they're not using Solr, but rather an in-house layer > > >> built > > >>> on > > >>>>> a customised version of Lucene. They're indexing around half a > > >> trillion > > >>>>> tweets. > > >>>>> > > >>>>> If the idea is to provide a searchable archive of all tweets, my > > >> first > > >>>>> question would be 'why': if the idea is to monitor new tweets for > > >>>>> particular patterns there are better ways to do this (Luwak for > > >>> example). > > >>>>> > > >>>>> Charlie > > >>>>> > > >>>>> > > >>>>>> If I were to guess at a sharded setup to handle such data, and > keep > > >>>>>>> 2 years worth, I would guess about 2500 shards. Is that > > >>>>>>> reasonable? > > >>>>>>> > > >>>>>> > > >>>>>> I think you need to think well beyond standard SolrCloud setups. > > >> Even > > >>>>>> if you manage to get 2500 shards running, you will want to do a > lot > > >>>>>> of tweaking on the way to issue queries so that each request does > > >> not > > >>>>>> require all 2500 shards to be searched. Prioritizing newer > material > > >>>>>> and only query the older shards if there is not enough resent > > >> results > > >>>>>> is an example. > > >>>>>> > > >>>>>> I highly doubt that a single SolrCloud is the best answer here. > > >> Maybe > > >>>>>> one cloud for each month and a lot of external logic? > > >>>>>> > > >>>>>> - Toke Eskildsen > > >>>>>> > > >>>>>> > > >>>>> > > >>>>> -- > > >>>>> Charlie Hull > > >>>>> Flax - Open Source Enterprise Search > > >>>>> > > >>>>> tel/fax: +44 (0)8700 118334 > > >>>>> mobile: +44 (0)7767 825828 > > >>>>> web: www.flax.co.uk > > >>>>> > > >>>> > > >>> > > >> > > > > >