Re: Indexing Twitter - Hypothetical

Walter Underwood Sun, 06 Mar 2016 10:30:06 -0800

This is a very good presentation on using entity extraction in query 
understanding. As you’ll see from the preso, it is not easy.


http://www.slideshare.net/dtunkelang/better-search-through-query-understanding 
<http://www.slideshare.net/dtunkelang/better-search-through-query-understanding>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 6, 2016, at 7:27 AM, Jack Krupansky <jack.krupan...@gmail.com> wrote:
> 
> Back to the original question... there are two answers:
> 
> 1. Yes - for guru-level Solr experts.
> 2. No - for anybody else.
> 
> For starters, (as always), you would need to do a lot more upfront work on
> mapping out the forms of query which will be supported. For example, is
> your focus on precision or recall. And, are you looking to analyze all
> matching tweets or just a sample. And, the load, throughput, and latency
> requirements. And, any spatial search requirements. And, any entity search
> requirements. Without a clear view of the query requirements it simply
> isn't possible to even begin defining a data model. And without a data
> model, indexing is a fool's errand. In short, no focus, no progress.
> 
> -- Jack Krupansky
> 
> On Sun, Mar 6, 2016 at 7:42 AM, Susheel Kumar <susheel2...@gmail.com> wrote:
> 
>> Entity Recognition means you may want to recognize different entities
>> name/person, email, location/city/state/country etc. in your
>> tweets/messages with goal of  providing better relevant results to users.
>> NER can be used at query or indexing (data enrichment) time.
>> 
>> Thanks,
>> Susheel
>> 
>> On Fri, Mar 4, 2016 at 7:55 PM, Joseph Obernberger <
>> joseph.obernber...@gmail.com> wrote:
>> 
>>> Thank you all very much for all the responses so far.  I've enjoyed
>> reading
>>> them!  We have noticed that storing data inside of Solr results in
>>> significantly worse performance (particularly faceting); so we store the
>>> values of all the fields elsewhere, but index all the data with Solr
>>> Cloud.  I think the suggestion about splitting the data up into blocks of
>>> date/time is where we would be headed.  Having two Solr-Cloud clusters -
>>> one to handle ~30 days of data, and one to handle historical.  Another
>>> option is to use a single Solr Cloud cluster, but use multiple
>>> cores/collections.  Either way you'd need a job to come through and clean
>>> up old data. The historical cluster would have much worse performance,
>>> particularly for clustering and faceting the data, but that may be
>>> acceptable.
>>> I don't know what you mean by 'entity recognition in the queries' - could
>>> you elaborate?
>>> 
>>> We would want to index and potentially facet on any of the fields - for
>>> example entities_media_url, username, even background color, but we do
>> not
>>> know a-priori what fields will be important to users.
>>> As to why we would want to make the data searchable; well - I don't make
>>> the rules!  Tweets is not the only data source, but it's certainly the
>>> largest that we are currently looking at handling.
>>> 
>>> I will read up on the Berlin Buzzwords - thank you for the info!
>>> 
>>> -Joe
>>> 
>>> 
>>> 
>>> On Fri, Mar 4, 2016 at 9:59 AM, Jack Krupansky <jack.krupan...@gmail.com
>>> 
>>> wrote:
>>> 
>>>> As always, the initial question is how you intend to query the data -
>>> query
>>>> drives data modeling. How real-time do you need queries to be? How fast
>>> do
>>>> you need archive queries to be? How many fields do you need to query
>> on?
>>>> How much entity recognition do you need in queries?
>>>> 
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> On Fri, Mar 4, 2016 at 4:19 AM, Charlie Hull <char...@flax.co.uk>
>> wrote:
>>>> 
>>>>> On 03/03/2016 19:25, Toke Eskildsen wrote:
>>>>> 
>>>>>> Joseph Obernberger <joseph.obernber...@gmail.com> wrote:
>>>>>> 
>>>>>>> Hi All - would it be reasonable to index the Twitter 'firehose'
>>>>>>> with Solr Cloud - roughly 500-600 million docs per day indexing
>>>>>>> each of the fields (about 180)?
>>>>>>> 
>>>>>> 
>>>>>> Possible, yes. Reasonable? It is not going to be cheap.
>>>>>> 
>>>>>> Twitter index the tweets themselves and have been quite open about
>>>>>> how they do it. I would suggest looking for their presentations;
>>>>>> slides or recordings. They have presented at Berlin Buzzwords and
>>>>>> Lucene/Solr Revolution and probably elsewhere too. The gist is that
>>>>>> they have done a lot of work and custom coding to handle it.
>>>>>> 
>>>>> 
>>>>> As I recall they're not using Solr, but rather an in-house layer
>> built
>>> on
>>>>> a customised version of Lucene. They're indexing around half a
>> trillion
>>>>> tweets.
>>>>> 
>>>>> If the idea is to provide a searchable archive of all tweets, my
>> first
>>>>> question would be 'why': if the idea is to monitor new tweets for
>>>>> particular patterns there are better ways to do this (Luwak for
>>> example).
>>>>> 
>>>>> Charlie
>>>>> 
>>>>> 
>>>>>> If I were to guess at a sharded setup to handle such data, and keep
>>>>>>> 2 years worth, I would guess about 2500 shards.  Is that
>>>>>>> reasonable?
>>>>>>> 
>>>>>> 
>>>>>> I think you need to think well beyond standard SolrCloud setups.
>> Even
>>>>>> if you manage to get 2500 shards running, you will want to do a lot
>>>>>> of tweaking on the way to issue queries so that each request does
>> not
>>>>>> require all 2500 shards to be searched. Prioritizing newer material
>>>>>> and only query the older shards if there is not enough resent
>> results
>>>>>> is an example.
>>>>>> 
>>>>>> I highly doubt that a single SolrCloud is the best answer here.
>> Maybe
>>>>>> one cloud for each month and a lot of external logic?
>>>>>> 
>>>>>> - Toke Eskildsen
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> Charlie Hull
>>>>> Flax - Open Source Enterprise Search
>>>>> 
>>>>> tel/fax: +44 (0)8700 118334
>>>>> mobile:  +44 (0)7767 825828
>>>>> web: www.flax.co.uk
>>>>> 
>>>> 
>>> 
>>

Re: Indexing Twitter - Hypothetical

Reply via email to