Re: indexing unstructured text (tweets)

Jack Krupansky Mon, 28 May 2012 07:42:57 -0700

Ah, okay. Here's some PHP regexp code for parsing a raw tweet to get usernames and hash tags:


http://saturnboy.com/2010/02/parsing-twitter-with-regexp/


-- Jack Krupansky

-----Original Message-----From: Giovanni Gherdovich

Sent: Monday, May 28, 2012 10:35 AM
To: solr-user@lucene.apache.org
Subject: Re: indexing unstructured text (tweets)

Hello Jack and Anuj,

2012/5/28 Jack Krupansky <j...@basetechnology.com>:

The Twitter API extracts hash tag and user mentions for you, in additionto
giving you the full raw text. You'll have to read up on the Twitter API.


That's what I thought just after hittind "send" on the message above ;-)
I am pretty sure the Twitter API format maps very nicely to a suitable
input format for Solr, if not even being already good for direct
feeding into Solr.

I am a bit unlucky here because I have been provided with
only the raw text for about 1.5 million tweets; so I would have
to write a few lines of code to restore at least user mentions,
hashtags and URLs.


2012/5/28 Anuj Kumar <anujs...@gmail.com>:

This is a bit old but provides good information for schema design-
http://www.readwriteweb.com/archives/this_is_what_a_tweet_looks_like.php

Found this link as well- https://gist.github.com/702360

The types of the field may depend on the search requirements.


Anuj you provide very interesting links here, thanks,
even tho those kind of specifics might be already present
in the twitter API doc.
After I'll be done with my first Solr setup, I might
setup the whole pipeline (getting the Twitter feeds myself)
on my machines, so that I can exploit the whole
information content provided by Twitter.

Cheers,

Giovanni

Re: indexing unstructured text (tweets)

Reply via email to