This is a very good presentation on using entity extraction in query understanding. As you’ll see from the preso, it is not easy.
http://www.slideshare.net/dtunkelang/better-search-through-query-understanding <http://www.slideshare.net/dtunkelang/better-search-through-query-understanding> wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 6, 2016, at 7:27 AM, Jack Krupansky <jack.krupan...@gmail.com> wrote: > > Back to the original question... there are two answers: > > 1. Yes - for guru-level Solr experts. > 2. No - for anybody else. > > For starters, (as always), you would need to do a lot more upfront work on > mapping out the forms of query which will be supported. For example, is > your focus on precision or recall. And, are you looking to analyze all > matching tweets or just a sample. And, the load, throughput, and latency > requirements. And, any spatial search requirements. And, any entity search > requirements. Without a clear view of the query requirements it simply > isn't possible to even begin defining a data model. And without a data > model, indexing is a fool's errand. In short, no focus, no progress. > > -- Jack Krupansky > > On Sun, Mar 6, 2016 at 7:42 AM, Susheel Kumar <susheel2...@gmail.com> wrote: > >> Entity Recognition means you may want to recognize different entities >> name/person, email, location/city/state/country etc. in your >> tweets/messages with goal of providing better relevant results to users. >> NER can be used at query or indexing (data enrichment) time. >> >> Thanks, >> Susheel >> >> On Fri, Mar 4, 2016 at 7:55 PM, Joseph Obernberger < >> joseph.obernber...@gmail.com> wrote: >> >>> Thank you all very much for all the responses so far. I've enjoyed >> reading >>> them! We have noticed that storing data inside of Solr results in >>> significantly worse performance (particularly faceting); so we store the >>> values of all the fields elsewhere, but index all the data with Solr >>> Cloud. I think the suggestion about splitting the data up into blocks of >>> date/time is where we would be headed. Having two Solr-Cloud clusters - >>> one to handle ~30 days of data, and one to handle historical. Another >>> option is to use a single Solr Cloud cluster, but use multiple >>> cores/collections. Either way you'd need a job to come through and clean >>> up old data. The historical cluster would have much worse performance, >>> particularly for clustering and faceting the data, but that may be >>> acceptable. >>> I don't know what you mean by 'entity recognition in the queries' - could >>> you elaborate? >>> >>> We would want to index and potentially facet on any of the fields - for >>> example entities_media_url, username, even background color, but we do >> not >>> know a-priori what fields will be important to users. >>> As to why we would want to make the data searchable; well - I don't make >>> the rules! Tweets is not the only data source, but it's certainly the >>> largest that we are currently looking at handling. >>> >>> I will read up on the Berlin Buzzwords - thank you for the info! >>> >>> -Joe >>> >>> >>> >>> On Fri, Mar 4, 2016 at 9:59 AM, Jack Krupansky <jack.krupan...@gmail.com >>> >>> wrote: >>> >>>> As always, the initial question is how you intend to query the data - >>> query >>>> drives data modeling. How real-time do you need queries to be? How fast >>> do >>>> you need archive queries to be? How many fields do you need to query >> on? >>>> How much entity recognition do you need in queries? >>>> >>>> >>>> -- Jack Krupansky >>>> >>>> On Fri, Mar 4, 2016 at 4:19 AM, Charlie Hull <char...@flax.co.uk> >> wrote: >>>> >>>>> On 03/03/2016 19:25, Toke Eskildsen wrote: >>>>> >>>>>> Joseph Obernberger <joseph.obernber...@gmail.com> wrote: >>>>>> >>>>>>> Hi All - would it be reasonable to index the Twitter 'firehose' >>>>>>> with Solr Cloud - roughly 500-600 million docs per day indexing >>>>>>> each of the fields (about 180)? >>>>>>> >>>>>> >>>>>> Possible, yes. Reasonable? It is not going to be cheap. >>>>>> >>>>>> Twitter index the tweets themselves and have been quite open about >>>>>> how they do it. I would suggest looking for their presentations; >>>>>> slides or recordings. They have presented at Berlin Buzzwords and >>>>>> Lucene/Solr Revolution and probably elsewhere too. The gist is that >>>>>> they have done a lot of work and custom coding to handle it. >>>>>> >>>>> >>>>> As I recall they're not using Solr, but rather an in-house layer >> built >>> on >>>>> a customised version of Lucene. They're indexing around half a >> trillion >>>>> tweets. >>>>> >>>>> If the idea is to provide a searchable archive of all tweets, my >> first >>>>> question would be 'why': if the idea is to monitor new tweets for >>>>> particular patterns there are better ways to do this (Luwak for >>> example). >>>>> >>>>> Charlie >>>>> >>>>> >>>>>> If I were to guess at a sharded setup to handle such data, and keep >>>>>>> 2 years worth, I would guess about 2500 shards. Is that >>>>>>> reasonable? >>>>>>> >>>>>> >>>>>> I think you need to think well beyond standard SolrCloud setups. >> Even >>>>>> if you manage to get 2500 shards running, you will want to do a lot >>>>>> of tweaking on the way to issue queries so that each request does >> not >>>>>> require all 2500 shards to be searched. Prioritizing newer material >>>>>> and only query the older shards if there is not enough resent >> results >>>>>> is an example. >>>>>> >>>>>> I highly doubt that a single SolrCloud is the best answer here. >> Maybe >>>>>> one cloud for each month and a lot of external logic? >>>>>> >>>>>> - Toke Eskildsen >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Charlie Hull >>>>> Flax - Open Source Enterprise Search >>>>> >>>>> tel/fax: +44 (0)8700 118334 >>>>> mobile: +44 (0)7767 825828 >>>>> web: www.flax.co.uk >>>>> >>>> >>> >>