Basically I'm working on the Enron dataset, and I've already de-duplicated
the collection and applied a spam filter. All the e-mails after this have
been parsed to XML and each field (so To, From, Date etc) has been
separated, along with one large field for the remaining e-mail content
(called Content). 

So yes, to answer your question. Bearing in mind though this still
represents around 240, 000ish files to compute.

I have no idea about Solr analyzers/search components, but my theory was
that I'd need an analyzer to remove 'banner-like' content from being indexed
and a search component to identify 'corporate-like' information in the
content of the e-mails.

What is a business logical solution and how will that work?


Thanks.



zayhen wrote:
> 
> I would go for a business logic solution and not a Solr customization in
> this case, as you need to filter information that you actually would like
> to
> see in diferent fields on your index.
> 
> Did you already tried to split the email in several fields like subject,
> from, to, content, signature, etc etc etc ?
> 
> 
> 2009/2/16 Johnny X <jonathanwel...@gmail.com>
> 
>>
>> Hi there,
>>
>>
>> I was told before that I'd need to create a custom search component to do
>> what I want to do, but I'm thinking it might actually be a custom
>> analyzer.
>>
>> Basically, I'm indexing e-mail in XML in Solr and searching the 'content'
>> field which is parsed as 'text'.
>>
>> I want to ignore certain elements of the e-mail (i.e. corporate banners),
>> but also identify the actual content of those e-mails including corporate
>> information.
>>
>> To identify the banners I need something a little more developed than a
>> stop
>> word list. I need to evaluate the frequency of certain words around words
>> like 'privileged' and 'corporate' within a word window of about 100ish
>> words
>> to determine whether they're banners and then remove them from being
>> indexed.
>>
>> I need to do the opposite during the same time to identify, in a similar
>> manner, which e-mails include corporate information in their actual
>> content.
>>
>> I suppose if I'm doing this I don't want what's processed to be indexed
>> as
>> what's returned in a search, because then presumably it won't be the full
>> e-mail, so do I need to store some kind of copy field that keeps the full
>> e-mail and is fully indexed to be returned instead?
>>
>> Can what I'm suggesting be done and can anyone direct me to a guide?
>>
>>
>> On another note, is there an easy way to destroy an index...any custom
>> code?
>>
>>
>> Thanks for any help!
>>
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> Alexander Ramos Jardim
> 
> 
> -----
> RPG da Ilha 
> 

-- 
View this message in context: 
http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22038912.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to