I just checked popular search services and it seems that neither lucidimagination search nor search-lucene support this: http://www.lucidimagination.com/search/document/954e8589ebbc4b16/terminating_slashes_in_url_normalization http://www.search-lucene.com/m?id=510143ac0608042241k49f4afe7wcd25df3fbacc7...@mail.gmail.com||mailman
Markmail does not support this as well http://markmail.org/message/papbjx3aoz3uvbhh Hmmm.... I think it would be useful to extract just the *NEW* content without all quotes because this influences Lucene scoring. Regards, Lukas On Mon, Mar 8, 2010 at 3:55 PM, Lukáš Vlček <lukas.vl...@gmail.com> wrote: > Hi, > > is anybody willing to share experience about how to extract content from > mailing list archives in order to have it indexed by Lucene or Solr? > > Imagine that we have access to archive of some mailling list (e.g. > http://www.mail-archive.com/mailman-users%40python.org/) and we would like > to index individual emails. Is there any easy way how to extract just the > text content produced by sender individual emails? I am interested in > content generated by particular sender omitting the original quoted text. We > can either access individual emails via web or we can download monthly > archive in plain text format (but the content of individual emails depends > on the email client of the author, i.e. plain text, html, html mixed with > plain text in <table> ... etc... it is very messy). > > I would prefer information about mailing lists managed by mailman but I > don't want to limit the scope of this question so any general ideas are > welcome. > > Regards, > Lukas >