Re: indexing unstructured text (tweets)

Giovanni Gherdovich Mon, 28 May 2012 07:09:59 -0700

Hello Jack, hi all,

2012/5/28 Jack Krupansky <j...@basetechnology.com>:
> Other obvious metadata from the Twitter API to index would be hashtags, user
> mentions (both the user id/screen name and user name), date/time, urls
> mentioned (expanded if a URL shortener is used), and possibly coordinates
> for spatial search.


You rise good points here.

Just to understand better how it works in Solr:
say that we have a tweet that makes use of a hashtag and
mentions another user. I don't know how this would actually
appear coming from the Twitter Streaming API,
and I am assuming that, at least the tweet itself
(excluding date/time and stuff) , is just raw text,
like

"Hey @alex1987, thank you for telling me how cool is #rubyonrails"

So: in order to make Solr understand that here we have
a mention to a user (@alex1987) and a hashtag (#rubyonrails)
I have to format it myself and include those info
in my own XML schema, and preprocess that tweet in
order to get to

<add>
<doc>
<field name="tweetId">ABCD1234</field>
<field name="tweet_text">Hey @alex1987, thank you for telling me how
cool is #rubyonrails</field>
<field name="user">happyRubyist</field>
<field name="mentions">alex1987</field>
<field name="hashtags">rubyonrails</field>
</doc>
</add>

Correct?
I have to preprocess and explicit those fields, if I want
them to be indexed as metadata, right?
I am asking since I am new here to Solr.

> Although, I imagine quite a few people have already done this quite a few
> times before, so maybe somebody could contribute their Twitter Solr schema.
> Anybody?

Oh that would be nice :-)

Cheers,
Giovanni

Re: indexing unstructured text (tweets)

Reply via email to