Basically I'm working on the Enron dataset, and I've already de-duplicated the collection and applied a spam filter. All the e-mails after this have been parsed to XML and each field (so To, From, Date etc) has been separated, along with one large field for the remaining e-mail content (called Content).
So yes, to answer your question. Bearing in mind though this still represents around 240, 000ish files to compute. I have no idea about Solr analyzers/search components, but my theory was that I'd need an analyzer to remove 'banner-like' content from being indexed and a search component to identify 'corporate-like' information in the content of the e-mails. What is a business logical solution and how will that work? Thanks. zayhen wrote: > > I would go for a business logic solution and not a Solr customization in > this case, as you need to filter information that you actually would like > to > see in diferent fields on your index. > > Did you already tried to split the email in several fields like subject, > from, to, content, signature, etc etc etc ? > > > 2009/2/16 Johnny X <jonathanwel...@gmail.com> > >> >> Hi there, >> >> >> I was told before that I'd need to create a custom search component to do >> what I want to do, but I'm thinking it might actually be a custom >> analyzer. >> >> Basically, I'm indexing e-mail in XML in Solr and searching the 'content' >> field which is parsed as 'text'. >> >> I want to ignore certain elements of the e-mail (i.e. corporate banners), >> but also identify the actual content of those e-mails including corporate >> information. >> >> To identify the banners I need something a little more developed than a >> stop >> word list. I need to evaluate the frequency of certain words around words >> like 'privileged' and 'corporate' within a word window of about 100ish >> words >> to determine whether they're banners and then remove them from being >> indexed. >> >> I need to do the opposite during the same time to identify, in a similar >> manner, which e-mails include corporate information in their actual >> content. >> >> I suppose if I'm doing this I don't want what's processed to be indexed >> as >> what's returned in a search, because then presumably it won't be the full >> e-mail, so do I need to store some kind of copy field that keeps the full >> e-mail and is fully indexed to be returned instead? >> >> Can what I'm suggesting be done and can anyone direct me to a guide? >> >> >> On another note, is there an easy way to destroy an index...any custom >> code? >> >> >> Thanks for any help! >> >> >> >> -- >> View this message in context: >> http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > > -- > Alexander Ramos Jardim > > > ----- > RPG da Ilha > -- View this message in context: http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22038912.html Sent from the Solr - User mailing list archive at Nabble.com.