The Twitter API extracts hash tag and user mentions for you, in addition to
giving you the full raw text. You'll have to read up on the Twitter API.
-- Jack Krupansky
-----Original Message-----
From: Giovanni Gherdovich
Sent: Monday, May 28, 2012 10:09 AM
To: solr-user@lucene.apache.org
Subject: Re: indexing unstructured text (tweets)
Hello Jack, hi all,
2012/5/28 Jack Krupansky <j...@basetechnology.com>:
Other obvious metadata from the Twitter API to index would be hashtags,
user
mentions (both the user id/screen name and user name), date/time, urls
mentioned (expanded if a URL shortener is used), and possibly coordinates
for spatial search.
You rise good points here.
Just to understand better how it works in Solr:
say that we have a tweet that makes use of a hashtag and
mentions another user. I don't know how this would actually
appear coming from the Twitter Streaming API,
and I am assuming that, at least the tweet itself
(excluding date/time and stuff) , is just raw text,
like
"Hey @alex1987, thank you for telling me how cool is #rubyonrails"
So: in order to make Solr understand that here we have
a mention to a user (@alex1987) and a hashtag (#rubyonrails)
I have to format it myself and include those info
in my own XML schema, and preprocess that tweet in
order to get to
<add>
<doc>
<field name="tweetId">ABCD1234</field>
<field name="tweet_text">Hey @alex1987, thank you for telling me how
cool is #rubyonrails</field>
<field name="user">happyRubyist</field>
<field name="mentions">alex1987</field>
<field name="hashtags">rubyonrails</field>
</doc>
</add>
Correct?
I have to preprocess and explicit those fields, if I want
them to be indexed as metadata, right?
I am asking since I am new here to Solr.
Although, I imagine quite a few people have already done this quite a few
times before, so maybe somebody could contribute their Twitter Solr
schema.
Anybody?
Oh that would be nice :-)
Cheers,
Giovanni