I've been asked to look at the Enron e-mail corpus (http://www.cs.cmu.edu/~enron/) and I've decided to use Solr as a means to analyse it.
So I have a few questions... First off, how can I convert the flat file text below: Message-ID: <[EMAIL PROTECTED]> Date: Mon, 14 May 2001 16:39:00 -0700 (PDT) From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-From: Phillip K Allen X-To: Tim Belden <Tim Belden/[EMAIL PROTECTED]> X-cc: X-bcc: X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail X-Origin: Allen-P X-FileName: pallen (Non-Privileged).pst Here is our forecast to XML to input into Solr. Secondly, I'm looking into searching for particular things in the e-mails and sorting them into groups as a result. Say, characteristics of the e-mails that suggest they concerns confidential company information for instance. How easy is it to make custom searches (based on semantics, word distances etc) and use the results as an output? I'm a complete newbie so any help is appreciated! I hope I've come to the right place. Thanks. :-) -- View this message in context: http://www.nabble.com/Large-Corpus-XML-Conversion--tp20389947p20389947.html Sent from the Solr - User mailing list archive at Nabble.com.