Ah, okay. Here's some PHP regexp code for parsing a raw tweet to get user
names and hash tags:
http://saturnboy.com/2010/02/parsing-twitter-with-regexp/
-- Jack Krupansky
-----Original Message-----
From: Giovanni Gherdovich
Sent: Monday, May 28, 2012 10:35 AM
To: solr-user@lucene.apache.org
Subject: Re: indexing unstructured text (tweets)
Hello Jack and Anuj,
2012/5/28 Jack Krupansky <j...@basetechnology.com>:
The Twitter API extracts hash tag and user mentions for you, in addition
to
giving you the full raw text. You'll have to read up on the Twitter API.
That's what I thought just after hittind "send" on the message above ;-)
I am pretty sure the Twitter API format maps very nicely to a suitable
input format for Solr, if not even being already good for direct
feeding into Solr.
I am a bit unlucky here because I have been provided with
only the raw text for about 1.5 million tweets; so I would have
to write a few lines of code to restore at least user mentions,
hashtags and URLs.
2012/5/28 Anuj Kumar <anujs...@gmail.com>:
This is a bit old but provides good information for schema design-
http://www.readwriteweb.com/archives/this_is_what_a_tweet_looks_like.php
Found this link as well- https://gist.github.com/702360
The types of the field may depend on the search requirements.
Anuj you provide very interesting links here, thanks,
even tho those kind of specifics might be already present
in the twitter API doc.
After I'll be done with my first Solr setup, I might
setup the whole pipeline (getting the Twitter feeds myself)
on my machines, so that I can exploit the whole
information content provided by Twitter.
Cheers,
Giovanni