Re: Indexing Twitter - Hypothetical

2016-03-08 Thread Jack Krupansky
You have my permission... and blessing... and... condolences! BTW, our usual recommendation is to do a subset proof of concept to see how all the pieces come together and then calculate the scaling from there. IOW, go ahead and index a day, a week, a month from the firehose and see how many nodes,

Re: Indexing Twitter - Hypothetical

2016-03-08 Thread Joseph Obernberger
Thank you for the links and explanation. We are using GATE (General Architecture for Text Engineering) and parts of the Stanford NER/Parser for the data that we ingest, but we do not apply it to the queries - only the data. We've been concentrating on the back-end, and analytics, not so much what

Re: Indexing Twitter - Hypothetical

2016-03-06 Thread Walter Underwood
This is a very good presentation on using entity extraction in query understanding. As you’ll see from the preso, it is not easy. http://www.slideshare.net/dtunkelang/better-search-through-query-understanding wunde

Re: Indexing Twitter - Hypothetical

2016-03-06 Thread Jack Krupansky
Back to the original question... there are two answers: 1. Yes - for guru-level Solr experts. 2. No - for anybody else. For starters, (as always), you would need to do a lot more upfront work on mapping out the forms of query which will be supported. For example, is your focus on precision or rec

Re: Indexing Twitter - Hypothetical

2016-03-06 Thread Susheel Kumar
Entity Recognition means you may want to recognize different entities name/person, email, location/city/state/country etc. in your tweets/messages with goal of providing better relevant results to users. NER can be used at query or indexing (data enrichment) time. Thanks, Susheel On Fri, Mar 4,

Re: Indexing Twitter - Hypothetical

2016-03-04 Thread Joseph Obernberger
Thank you all very much for all the responses so far. I've enjoyed reading them! We have noticed that storing data inside of Solr results in significantly worse performance (particularly faceting); so we store the values of all the fields elsewhere, but index all the data with Solr Cloud. I thin

Re: Indexing Twitter - Hypothetical

2016-03-04 Thread Jack Krupansky
As always, the initial question is how you intend to query the data - query drives data modeling. How real-time do you need queries to be? How fast do you need archive queries to be? How many fields do you need to query on? How much entity recognition do you need in queries? -- Jack Krupansky On

Re: Indexing Twitter - Hypothetical

2016-03-04 Thread Charlie Hull
On 03/03/2016 19:25, Toke Eskildsen wrote: Joseph Obernberger wrote: Hi All - would it be reasonable to index the Twitter 'firehose' with Solr Cloud - roughly 500-600 million docs per day indexing each of the fields (about 180)? Possible, yes. Reasonable? It is not going to be cheap. Twitter

Re: Indexing Twitter - Hypothetical

2016-03-03 Thread Alexandre Rafalovitch
I think some of the Twitter's need to index in a particular way comes from their real-time need. So, that's part of the decision for the original poster, on how responsive data needs to be. As to the rest, I think the company that shows twitter messages on TV does something similar with Solr. They

Re: Indexing Twitter - Hypothetical

2016-03-03 Thread Jack Krupansky
As always, the initial question always needs to be how you wish to query the data - query will drive the data model. I don't want to put words in your mouth as to your query requirements, so... clue us in on your query requirements. -- Jack Krupansky On Thu, Mar 3, 2016 at 2:25 PM, Toke Eskildse

Re: Indexing Twitter - Hypothetical

2016-03-03 Thread Toke Eskildsen
Joseph Obernberger wrote: > Hi All - would it be reasonable to index the Twitter 'firehose' with Solr > Cloud - roughly 500-600 million docs per day indexing each of the fields > (about 180)? Possible, yes. Reasonable? It is not going to be cheap. Twitter index the tweets themselves and have bee

Indexing Twitter - Hypothetical

2016-03-03 Thread Joseph Obernberger
Hi All - would it be reasonable to index the Twitter 'firehose' with Solr Cloud - roughly 500-600 million docs per day indexing each of the fields (about 180)? If I were to guess at a sharded setup to handle such data, and keep 2 years worth, I would guess about 2500 shards. Is that reasonable? Is