Hi Leif I've had the same problem. I tried with 4.2.0 as well, in both fedora 17 and centos6, using java-6 and java-7 (openjdk and oracel/sun as well). I could NEVER use example-DIH against a mailbox having mails attachments. Only mails without them, even if they were HTML, but as long as I included at least 2 MIME parts (body + attachment), they disappeared from the indexing.
So I decided to put some traces in the code, and I found out that the trace "isMimeType #2" is never shown. After I've modified the code, I'm sure that for every mail with attachment I send, the code "part.getContent()" returns null, hence the unexpected result. #FILE: solr-4.2.0/solr/contrib/dataimporthandler- extras/src/java/org/apache/solr/handler/dataimport/MailEntityProcessor.java public void addPartToDocument(Part part, Map<String, Object> row, boolean outerMost) throws Exception { LOG.info("Inside addPartToDocument start"); if (part instanceof Message) { LOG.info("Inside addPartToDocument.part instanceof message"); addEnvelopToDocument(part, row); } String ct = part.getContentType(); ContentType ctype = new ContentType(ct); if (part.isMimeType("multipart/*")) { LOG.info("Inside addPartToDocument.isMimeType #1 "+ ct); if (part.getContent() != null){ Multipart mp = (Multipart) part.getContent(); LOG.info("Inside addPartToDocument.isMimeType #2"); int count = mp.getCount(); LOG.info("Inside addPartToDocument.isMimeType #3 count is"+String.valueOf(count)); if (part.isMimeType("multipart/alternative")) count = 1; for (int i = 0; i < count; i++) { LOG.info("Inside addPartToDocument.isMimeType.for()"); addPartToDocument(mp.getBodyPart(i), row, false); } } } else if (part.isMimeType("message/rfc822")) { addPartToDocument((Part) part.getContent(), row, false); } else { LOG.info("Inside addPartToDocument.ELSE #1"); String disp = part.getDisposition(); if (!processAttachment || (disp != null && disp.equalsIgnoreCase(Part.ATTACHMENT))) return; LOG.info("Inside addPartToDocument.ELSE #2"); InputStream is = part.getInputStream(); String fileName = part.getFileName(); Metadata md = new Metadata(); md.set(HttpHeaders.CONTENT_TYPE, ctype.getBaseType().toLowerCase(Locale.ROOT)); md.set(TikaMetadataKeys.RESOURCE_NAME_KEY, fileName); String content = tika.parseToString(is, md); LOG.info("Inside addPartToDocument.ELSE #3"); if (disp != null && disp.equalsIgnoreCase(Part.ATTACHMENT)) { LOG.info("Inside addPartToDocument.ELSE #4aaa"); if (row.get(ATTACHMENT) == null) row.put(ATTACHMENT, new ArrayList<String>()); List<String> contents = (List<String>) row.get(ATTACHMENT); contents.add(content); row.put(ATTACHMENT, contents); if (row.get(ATTACHMENT_NAMES) == null) row.put(ATTACHMENT_NAMES, new ArrayList<String>()); List<String> names = (List<String>) row.get(ATTACHMENT_NAMES); names.add(fileName); row.put(ATTACHMENT_NAMES, names); } else { LOG.info("Inside addPartToDocument.ELSE #4bis"); if (row.get(CONTENT) == null) row.put(CONTENT, new ArrayList<String>()); List<String> contents = (List<String>) row.get(CONTENT); contents.add(content); row.put(CONTENT, contents); } } } My solrconfig is the same as included in the example-DIH folder. I've found in google that javamail+activation could cause problems if the version included in the application doesn't match the ones that are now included in the JRE. I tried removing them, putting newer versions, etc, but no result. I believe that the handling of the multipart MIME lacks some error checking, and it is probably related to the content outside the MIME boundaries (in my example, the text "This is a multi-part message in MIME format."): I really hope that some SOLR developer can have a look, we cannot be the only ones having this problem. And I've spent almost twenty hours debugging this. Regards PS: Example of mail that doesn't get processed: Return-Path: marcos.gar...@savoirfairelinux.com Received: from mail.savoirfairelinux.com (LHLO mail.savoirfairelinux.com) (192.168.52.6) by mail.savoirfairelinux.com with LMTP; Mon, 18 Mar 2013 12:10:22 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by mail.savoirfairelinux.com (Postfix) with ESMTP id 39CA425819D for <solr.t...@savoirfairelinux.com>; Mon, 18 Mar 2013 12:10:22 -0400 (EDT) X-Virus-Scanned: amavisd-new at mail.savoirfairelinux.com X-Spam-Flag: NO X-Spam-Score: -2.9 X-Spam-Level: X-Spam-Status: No, score=-2.9 tagged_above=-10 required=4.4 tests=[ALL_TRUSTED=-1, BAYES_00=-1.9] autolearn=ham Received: from mail.savoirfairelinux.com ([127.0.0.1]) by localhost (mail.savoirfairelinux.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fxeONJBdw8JA for <solr.t...@savoirfairelinux.com>; Mon, 18 Mar 2013 12:10:21 -0400 (EDT) Received: from [192.168.50.126] (mtl.savoirfairelinux.net [208.88.110.46]) by mail.savoirfairelinux.com (Postfix) with ESMTPSA id D0BC025819B for <solr.t...@savoirfairelinux.com>; Mon, 18 Mar 2013 12:10:21 -0400 (EDT) Message-ID: <51473c6d.8010...@savoirfairelinux.com> Date: Mon, 18 Mar 2013 12:10:21 -0400 From: Marcos Garcia <marcos.gar...@savoirfairelinux.com> Organization: SFL User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130311 Thunderbird/17.0.4 MIME-Version: 1.0 To: solr.t...@savoirfairelinux.com Subject: normal mail 2 X-Opacus-Archived: none Content-Type: multipart/mixed; boundary="------------020306090503030103060209" This is a multi-part message in MIME format. --------------020306090503030103060209 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit there it goes 2 -- Marcos Garcia Consultant en logiciel libre Savoir-faire Linux http://www.savoirfairelinux.com marcos.gar...@savoirfairelinux.com Tel : (514) 276-5468 ext 137 --------------020306090503030103060209 Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet; name="videos a supprimer.xlsx" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="videos a supprimer.xlsx" UEsDBBQABgAIAAAAIQBxDjkrcAEAAKAFAAATANsBW0NvbnRlbnRfVHlwZXNdLnhtbCCi1wEo oAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADMlE1OwzAQhfdI3CHyFiVu i4QQatoFP0uoBBzA2JPGqmNbHre0t2eS0ApQiFTSBZtEUTTvvflm7Ol8W5lkAwG1szkbZyOW gJVOabvM2evLQ3rNEozCKmGchZztANl8dn42fdl5wISqLeasjNHfcI6yhEpg5jxY+lO4UIlI n2HJvZArsQQ+GY2uuHQ2go1prDXYbPpEAYJWkCxEiI+iIh++NTySGrTPcUZ6LLltC2vvnAnv jZYiUnK+seqHa+qKQktQTq4r8soasYtahf9qiHFnAAdboQ8gFJYAsTJZK7p3voNCrE1M7rdE oIUewOBxrX3CzKiyaR9L7bHHoZ9dP5N3F1Zvzq1OTaWmk1VC233uriWg6S2C88hp1oMDQI1c gUo9SUKIGg7MurxpAevemzEib16TwRm+r8ZBv49BR47Lf5Jj+Kn8Gw8sRQD1HAPdUic/rl+1 ++Zy2E3pAhw/kP0Zrqs7NpI39+vsAwAA//8DAFBLAwQUAAYACAAAACEAtVUwI/UAAABMAgAA CwDOAV9yZWxzLy5yZWxzIKLKASigAAIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA PS: exemple of my traces output Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor getNextMail INFO: Inside getNextMail before next() Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor addPartToDocument INFO: Inside addPartToDocument start Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor addPartToDocument INFO: Inside addPartToDocument.part instanceof message Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor addPartToDocument INFO: Inside addPartToDocument.isMimeType #1 multipart/MIXED; boundary=--------- ---020306090503030103060209 Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor getNextMail INFO: Inside getNextMail before next() Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor addPartToDocument INFO: Inside addPartToDocument start Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor addPartToDocument INFO: Inside addPartToDocument.part instanceof message Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor addPartToDocument INFO: Inside addPartToDocument.ELSE #1 Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor addPartToDocument INFO: Inside addPartToDocument.ELSE #2 Leif Hetlesæther <leif <at> gurusoft.no> writes: