Re: indexing unstructured text (tweets)

David Radunz Mon, 28 May 2012 05:00:47 -0700

Hey,

I think you might be over-thinking this. Tweets are structured. Youhave the content (tweet), the user who tweeted it and various other metadata. So your 'document', might look like this:


<add>
<doc>
<field name="tweetId">ABCD1234</field>
<field name="tweet">I bought some apples</field>
<field name="user">JohnnyBoy</field>
</doc>
</add>

To get this structure, you can use any programming language yourcomfortable with and load it into Solr via various means. Obviously youcan add more 'meta' fields that you get from twitter if you want as well.


David

On 28/05/2012 9:37 PM, Giovanni Gherdovich wrote:

Hi all.

I am in the process of setting up Solr for my application,
which is full text search on a bunch of tweets from twitter.

I am afraid I am missing something.
 From the books I am reading, "Apache Solr 3 Enterprise Search Server",
it looks like Solr works with structured input, like XML or CVS,
while I have the most wild and unstructured input ever (tweets).
A section named "Indexing documents with Solr Cell" seems to address my problem,
but also shows that before getting to Solr, I might need to use
another Apache tool called Tika.

Can anybody provide a brief explaination about the general picture?
Can I index my tweets with Solr?
Or do I need to put also Tika in my pipeline?

Best regards,
Giovanni Gherdovich

Re: indexing unstructured text (tweets)

Reply via email to