I've been asked to look at the Enron e-mail corpus
(http://www.cs.cmu.edu/~enron/) and I've decided to use Solr as a means to
analyse it. 

So I have a few questions...

First off, how can I convert the flat file text below:


Message-ID: <[EMAIL PROTECTED]>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/[EMAIL PROTECTED]>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast




to XML to input into Solr.

Secondly, I'm looking into searching for particular things in the e-mails
and sorting them into groups as a result. Say, characteristics of the
e-mails that suggest they concerns confidential company information for
instance.

How easy is it to make custom searches (based on semantics, word distances
etc) and use the results as an output?


I'm a complete newbie so any help is appreciated! I hope I've come to the
right place.

Thanks. :-)
-- 
View this message in context: 
http://www.nabble.com/Large-Corpus-XML-Conversion--tp20389947p20389947.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to