Hi Eric, Ioan, Oleg and others, as offered in July: > I would also like to add more test cases and especially include some > dummy mboxes. And as mentioned I'd like to check the iterator against > all my Thunderbird mboxes to check > whether it will successfully parse them all. I started doing this based on the improvements that you kindly checked in in the meantime. So I am working with 0.8.0-SNAPSHOT at thist time.
I intend to run the iterator against some 1/4 million emails in some 850 mailboxes. I got as far as some message 400 with 0.7.2. With 0.8.0-SNAPSHOT the library chockes at message some 4000 which is from the apple store ! it contains: <[email protected]> Content-Type: TEXT/HTML; CHARSET=None Content-Transfer-Encoding: QUOTED-PRINTABLE And I ran into bug https://issues.apache.org/jira/browse/MIME4J-218 I tried: /** * Lenient BodyFactory that fixes * https://issues.apache.org/jira/browse/MIME4J-218 won't fix behaviour * * @author wf * */ public static class LenientBodyFactory extends BasicBodyFactory { @Override public Charset resolveCharset(final String mimeCharset) throws UnsupportedEncodingException { Charset result=Charset.defaultCharset(); try { result=super.resolveCharset(mimeCharset); } catch (UnsupportedEncodingException ex) { // ignore } return result; } } Which didn't work since resolveCharset is static private ... :-( I proposed the following fix for dom/src/main/java/org/apache/james/mime4j/message/BasicBodyFactory.java: public static boolean lenient=true; /** * select the Charset for the given mimeCharset string * * if you need support for non standard or invalid mimeCharset specifications * you might want to create your own derived BodyFactory extending BasicBodyFactory and * overriding this method as suggested by: * https://issues.apache.org/jira/browse/MIME4J-218 * * the default behaviour is lenient, invalid mimeCharset specs will return the defaultCharset * * @param mimeCharset - the string specification for a charset e.g. "UTF-8" * @throws UnsupportedEncodingException if the mimeCharset is invalid */ protected Charset resolveCharset(final String mimeCharset) throws UnsupportedEncodingException { Charset result=null; if (lenient) { result=Charset.defaultCharset(); } if (mimeCharset !=null) { try { result= Charset.forName(mimeCharset); } catch (UnsupportedCharsetException ex) { if (!lenient) throw new UnsupportedEncodingException(mimeCharset); } } return result; } Now I was hoping to be able to test this fix. I assume I have to add some test message to: core: src/test/resources/testmsgs But to really check the new behaviour they'd have to be three different tests: 1. check invalid mimeCharset in lenient mode - will work with default Charset 2. check invalid mimeCharset in non-lenient mode - will throw exception 3. check invalid mimeCharset in non-lenient mode with overridden resolveCharset - will work with chosen mapped Charset. Please let me know how I can add these tests and how get a proper patchset going. I don't work much with subversion theses days - i prefer to use git. Cheers Wolfgang Am 10.08.14 um 10:33 schrieb Stan Ioan Eugen: > Hello Wolfgang, > > Sorry for my late reply. I've created a Jira ticket to track this > issue. As Eric suggested, it's the right way to do get code into the > project. > I've looked over the code and it looks good in general. I would keep > both variants of the regular expression to match FROM lines, with a > good javadoc, so users can use any of them in their code. I would > also move the 'mbox != null' check inside the constructor - this way > we make sure we don't create an object in an inconsistent state. > > I will be more than happy to push the patch upstream once we have some > tests for the new behavior. Are you interested in providing the tests? > > Please use the issue for patch submission and relevant comments. > https://issues.apache.org/jira/browse/MIME4J-242 > > Thanks, > > > 2014-08-03 10:52 GMT+03:00 Eric Charles <[email protected]>: >> Could you open on JIRA on https://issues.apache.org/jira/browse/MIME4J >> and upload there your patch? Thx. >> >> On 07/23/2014 09:57 AM, Wolfgang Fahl wrote: >>> Hi Ioan Eugen, >>> >>> please find attached a patch. >>> >>> it uses the following fromline pattern: >>> static final String DEFAULT = "^From \\S+.*\\d{4}$"; >>> so that it matches more lines. >>> 1. From [email protected] Fri Sep 09 14:04:52 2011 >>> 2. From MAILER-DAEMON Wed Oct 05 21:54:09 2011 >>> 3. From - Wed Apr 02 06:51:08 2014 >>> >>> so looking for an "@" sign is not enforced any more. >>> >>> The patch fixes a typo: >>> - private Matcher fromLineMathcer; >>> + private Matcher fromLineMatcher; >>> >>> in many places of the source code. >>> >>> It adds a reference to the original mbox File so that the error message: >>> + if (mbox!=null) >>> + path=mbox.getPath(); >>> + throw new IllegalArgumentException("File "+path+" does not >>> contain From_ lines that match the pattern >>> '"+MESSAGE_START.pattern()+"'! Maybe not be a valid Mbox."); >>> >>> can be improved. >>> >>> Who is going to check this patch and what needs to be done to get it >>> into the official repo? >>> I would also like to add more test cases and especially include some >>> dummy mboxes. And as mentioned I'd like to check the iterator against >>> all my Thunderbird mboxes to check >>> whether it will successfully parse them all. Also I am offering to write >>> a few "tutorial lines". Where would I have to put these? >>> >>> Cheers >>> Wolfgang >>> >>> Am 22.07.14 22:23, schrieb Ioan Eugen Stan: >>>> Hello Wolfgang, >>>> >>>> I developed MailboxIterator. It's nice to see that it's helpful :) >>>> >>>> You get that error because MboxIterator does not know how to split the >>>> messages. Messages in an mbox file are separated via lines that start >>>> with '' From:'. They are called (by me at least) 'From lines' :) . >>>> One problem with the mbox format is that it's a bit 'free-form' in the >>>> sense that developers abused it and we have some variants [1]. >>>> >>>> One thing that you could try is to supply a different From line >>>> regular expression to MboxIterator via regexpPattern argument. It will >>>> split messages based on this new value. >>>> >>>> [1] http://wiki2.dovecot.org/MailboxFormat/mbox >>>> >>>> Good luck and please post the your results. >>>> >>>> Regards, >>>> >>>> On Fri, Jul 18, 2014 at 12:53 PM, Wolfgang Fahl <[email protected]> wrote: >>>>> Dear mime4j developers, >>>>> >>>>> for one of my projects I have been using mime4j successfully to import >>>>> e-mail into our CRM database for some two years know. >>>>> Currently I am trying to add a feature which would allow reading Mozilla >>>>> Thunderbird Mailbox content. >>>>> As of mime4j 0.8 there seems to be a MboxIterator which could do that. >>>>> Since I didn't find any publicly available source repository which I >>>>> could use to access the 0.8-Snapshop I have copied >>>>> the three source files: >>>>> * CharBufferWrapper.java >>>>> * FromLinePatterns.java >>>>> * MboxIterator.java >>>>> >>>>> into my source tree and I am using these together with the following >>>>> maven dependency: >>>>> >>>>> <!-- EMail handling --> >>>>> <dependency> >>>>> <groupId>org.apache.james</groupId> >>>>> <artifactId>apache-mime4j-core</artifactId> >>>>> <version>0.7.2</version> >>>>> </dependency> >>>>> <dependency> >>>>> <groupId>org.apache.james</groupId> >>>>> <artifactId>apache-mime4j-dom</artifactId> >>>>> <version>0.7.2</version> >>>>> </dependency> >>>>> >>>>> The iterator works somewhat o.k. on some of the Thunderbird mailbox >>>>> files and loops thru the mails in it correctly. >>>>> The mails can than not be directly parsed with mime4j - there is one >>>>> newline at the begining which spoils the show. After >>>>> working around this it's working as expected in some cases. In other >>>>> cases there is an error: >>>>> >>>>> java.lang.IllegalArgumentException: File does not contain From_ lines! >>>>> Maybe not be a vaild Mbox. >>>>> at >>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.initMboxIterator(MboxIterator.java:85) >>>>> at >>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:75) >>>>> at >>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:62) >>>>> at >>>>> org.apache.james.mime4j.mboxiterator.MboxIterator$Builder.build(MboxIterator.java:241) >>>>> at >>>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:386) >>>>> at >>>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:261) >>>>> at >>>>> com.bitplan.clientutils.rest.TestMailAccess.testMailById(TestMailAccess.java:77) >>>>> >>>>> By the way - there is a typo in the above error message "vaild" should >>>>> be "valid". >>>>> >>>>> The error is something I'd like to fix or work-around. >>>>> >>>>> I have two big user accounts with several hundred mailbox files and some >>>>> 300.000 mails from the last 15 years which I'd like >>>>> to use as a testcase against which to run the mime4j implementation. >>>>> >>>>> Would you please supply me with some pointers where I get the necessary >>>>> source code and how i could supply patches and >>>>> testcases for the project? >>>>> >>>>> Also it would be good to know whether others would be interested in the >>>>> Thunderbird Mailbox reading capability. >>>>> >>>>> >>>>> Cheers >>>>> Wolfgang >>>>> >>>>> -- >>>>> >>>>> BITPlan - smart solutions >>>>> Wolfgang Fahl >>>>> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn >>>>> Tel. +49 2154 811-480, Fax +49 2154 811-481 >>>>> Web: http://www.bitplan.de >>>>> BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, >>>>> Geschäftsführer: Wolfgang Fahl >>>>> >>>> > > -- BITPlan - smart solutions Wolfgang Fahl Pater-Delp-Str. 1, D-47877 Willich Schiefbahn Tel. +49 2154 811-480, Fax +49 2154 811-481 Web: http://www.bitplan.de BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl
