Re: indexing unstructured text (tweets)

Jack Krupansky Mon, 28 May 2012 07:24:10 -0700

The Twitter API extracts hash tag and user mentions for you, in addition togiving you the full raw text. You'll have to read up on the Twitter API.


-- Jack Krupansky

-----Original Message-----From: Giovanni Gherdovich

Sent: Monday, May 28, 2012 10:09 AM
To: [email protected]
Subject: Re: indexing unstructured text (tweets)

Hello Jack, hi all,

2012/5/28 Jack Krupansky <[email protected]>:

Other obvious metadata from the Twitter API to index would be hashtags,user
mentions (both the user id/screen name and user name), date/time, urls
mentioned (expanded if a URL shortener is used), and possibly coordinates
for spatial search.


You rise good points here.

Just to understand better how it works in Solr:
say that we have a tweet that makes use of a hashtag and
mentions another user. I don't know how this would actually
appear coming from the Twitter Streaming API,
and I am assuming that, at least the tweet itself
(excluding date/time and stuff) , is just raw text,
like

"Hey @alex1987, thank you for telling me how cool is #rubyonrails"

So: in order to make Solr understand that here we have
a mention to a user (@alex1987) and a hashtag (#rubyonrails)
I have to format it myself and include those info
in my own XML schema, and preprocess that tweet in
order to get to

<add>
<doc>
<field name="tweetId">ABCD1234</field>
<field name="tweet_text">Hey @alex1987, thank you for telling me how
cool is #rubyonrails</field>
<field name="user">happyRubyist</field>
<field name="mentions">alex1987</field>
<field name="hashtags">rubyonrails</field>
</doc>
</add>

Correct?
I have to preprocess and explicit those fields, if I want
them to be indexed as metadata, right?
I am asking since I am new here to Solr.

Although, I imagine quite a few people have already done this quite a few
times before, so maybe somebody could contribute their Twitter Solrschema.
Anybody?


Oh that would be nice :-)

Cheers,

Giovanni

Re: indexing unstructured text (tweets)

Reply via email to