Good afternoon,
I have data parsing challenge related to our use of mime4j. We encounter mbox
data that is unconventional in structure, but we are required to process
nonetheless. The particular mbox files we are having issues with are very large
(some over 5MB), and are headers-only. Mime4j likely is parsing the files
properly, but the time it takes is prohibitively long.
We use the MimeTokenStream parser. We don't believe this can be addressed in
configuration (i.e. via MimeConfig).
An ideal situation would be to be able to specify if the number of headers
processed exceeds maxHeaders, then stop parsing, reset the stream pointer to
the beginning of the input stream and just output as one giant header (or
body?) or, probably more realistically, chuck the output in chunks manageable
for whatever is reasonable for IO parsing of this nature.
Otherwise, I guess it's a custom coding solution? It would appear that it would
perhaps involve a custom parser that extends or borrows from MimeStreamParser
or MimeTokenStream, or both. For instance, below is the critical area of code
from MimeStreamParser where we want to avoid getting stuck in processing these
5 MB header-only files.
Grateful for any response. Thanks!
while(true) {
EntityState state = this.mimeTokenStream.getState();
switch (state) {
case T_BODY:
BodyDescriptor desc =
this.mimeTokenStream.getBodyDescriptor();
InputStream bodyContent;
if (this.contentDecoding) {
bodyContent =
this.mimeTokenStream.getDecodedInputStream();
} else {
bodyContent = this.mimeTokenStream.getInputStream();
}
,