Hi Leif

I've had the same problem. I tried with 4.2.0 as well, in both fedora 17 and 
centos6, using java-6 and java-7 (openjdk and oracel/sun as well). I could 
NEVER 
use example-DIH against a mailbox having mails attachments. Only mails without 
them, even if they were HTML, but as long as I included at least 2 MIME parts 
(body + attachment), they disappeared from the indexing.

So I decided to put some traces in the code, and I found out that the trace 
"isMimeType #2" is never shown. After I've modified the code, I'm sure that for 
every mail with attachment I send, the code "part.getContent()" returns null, 
hence the unexpected result.

#FILE: solr-4.2.0/solr/contrib/dataimporthandler-
extras/src/java/org/apache/solr/handler/dataimport/MailEntityProcessor.java

  public void addPartToDocument(Part part, Map<String, Object> row, boolean 
outerMost) throws Exception {
    LOG.info("Inside addPartToDocument start");
    if (part instanceof Message) {
      LOG.info("Inside addPartToDocument.part instanceof message");
      addEnvelopToDocument(part, row);
    }

    String ct = part.getContentType();
    ContentType ctype = new ContentType(ct);
    if (part.isMimeType("multipart/*")) {
      LOG.info("Inside addPartToDocument.isMimeType #1 "+ ct);
      if (part.getContent() != null){
        Multipart mp = (Multipart) part.getContent();
        LOG.info("Inside addPartToDocument.isMimeType #2");
        int count = mp.getCount();
        LOG.info("Inside addPartToDocument.isMimeType #3 count 
is"+String.valueOf(count));
        if (part.isMimeType("multipart/alternative"))
          count = 1;
        for (int i = 0; i < count; i++)
        {
          LOG.info("Inside addPartToDocument.isMimeType.for()");
          addPartToDocument(mp.getBodyPart(i), row, false);
        }
      }
    } else if (part.isMimeType("message/rfc822")) {
      addPartToDocument((Part) part.getContent(), row, false);
    } else {
      LOG.info("Inside addPartToDocument.ELSE #1");
      String disp = part.getDisposition();
      if (!processAttachment || (disp != null && 
disp.equalsIgnoreCase(Part.ATTACHMENT)))        return;
      LOG.info("Inside addPartToDocument.ELSE #2");
      InputStream is = part.getInputStream();
      String fileName = part.getFileName();
      Metadata md = new Metadata();
      md.set(HttpHeaders.CONTENT_TYPE, 
ctype.getBaseType().toLowerCase(Locale.ROOT));
      md.set(TikaMetadataKeys.RESOURCE_NAME_KEY, fileName);
      String content = tika.parseToString(is, md);
      LOG.info("Inside addPartToDocument.ELSE #3");
      if (disp != null && disp.equalsIgnoreCase(Part.ATTACHMENT)) {
        LOG.info("Inside addPartToDocument.ELSE #4aaa");
        if (row.get(ATTACHMENT) == null)
          row.put(ATTACHMENT, new ArrayList<String>());
        List<String> contents = (List<String>) row.get(ATTACHMENT);
        contents.add(content);
        row.put(ATTACHMENT, contents);
        if (row.get(ATTACHMENT_NAMES) == null)
          row.put(ATTACHMENT_NAMES, new ArrayList<String>());
        List<String> names = (List<String>) row.get(ATTACHMENT_NAMES);
        names.add(fileName);
        row.put(ATTACHMENT_NAMES, names);
      } else {
        LOG.info("Inside addPartToDocument.ELSE #4bis");
        if (row.get(CONTENT) == null)
          row.put(CONTENT, new ArrayList<String>());
        List<String> contents = (List<String>) row.get(CONTENT);
        contents.add(content);
        row.put(CONTENT, contents);
      }
    }
  }

My solrconfig is the same as included in the example-DIH folder.

I've found in google that javamail+activation could cause problems if the 
version included in the application doesn't match the ones that are now 
included 
in the JRE. I tried removing them, putting newer versions, etc, but no result.

I believe that the handling of the multipart MIME lacks some error checking, 
and 
it is probably related to the content outside the MIME boundaries (in my 
example, the text "This is a multi-part message in MIME format."):

I really hope that some SOLR developer can have a look, we cannot be the only 
ones having this problem. And I've spent almost twenty hours debugging this.

Regards

PS: Example of mail that doesn't get processed:
Return-Path: marcos.gar...@savoirfairelinux.com
Received: from mail.savoirfairelinux.com (LHLO mail.savoirfairelinux.com)
 (192.168.52.6) by mail.savoirfairelinux.com with LMTP; Mon, 18 Mar 2013
 12:10:22 -0400 (EDT)
Received: from localhost (localhost [127.0.0.1])
        by mail.savoirfairelinux.com (Postfix) with ESMTP id 39CA425819D
        for <solr.t...@savoirfairelinux.com>; Mon, 18 Mar 2013 12:10:22 -0400 
(EDT)
X-Virus-Scanned: amavisd-new at mail.savoirfairelinux.com
X-Spam-Flag: NO
X-Spam-Score: -2.9
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 tagged_above=-10 required=4.4
        tests=[ALL_TRUSTED=-1, BAYES_00=-1.9] autolearn=ham
Received: from mail.savoirfairelinux.com ([127.0.0.1])
        by localhost (mail.savoirfairelinux.com [127.0.0.1]) (amavisd-new, port 
10024)
        with ESMTP id fxeONJBdw8JA for <solr.t...@savoirfairelinux.com>;
        Mon, 18 Mar 2013 12:10:21 -0400 (EDT)
Received: from [192.168.50.126] (mtl.savoirfairelinux.net [208.88.110.46])
        by mail.savoirfairelinux.com (Postfix) with ESMTPSA id D0BC025819B
        for <solr.t...@savoirfairelinux.com>; Mon, 18 Mar 2013 12:10:21 -0400 
(EDT)
Message-ID: <51473c6d.8010...@savoirfairelinux.com>
Date: Mon, 18 Mar 2013 12:10:21 -0400
From: Marcos Garcia <marcos.gar...@savoirfairelinux.com>
Organization: SFL
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130311 
Thunderbird/17.0.4
MIME-Version: 1.0
To: solr.t...@savoirfairelinux.com
Subject: normal mail 2
X-Opacus-Archived: none
Content-Type: multipart/mixed;
 boundary="------------020306090503030103060209"

This is a multi-part message in MIME format.
--------------020306090503030103060209
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

there it goes 2


-- 
Marcos Garcia
Consultant en logiciel libre
Savoir-faire Linux
http://www.savoirfairelinux.com
marcos.gar...@savoirfairelinux.com
Tel : (514) 276-5468 ext 137


--------------020306090503030103060209
Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet;
 name="videos a supprimer.xlsx"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename="videos a supprimer.xlsx"

UEsDBBQABgAIAAAAIQBxDjkrcAEAAKAFAAATANsBW0NvbnRlbnRfVHlwZXNdLnhtbCCi1wEo
oAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADMlE1OwzAQhfdI3CHyFiVu
i4QQatoFP0uoBBzA2JPGqmNbHre0t2eS0ApQiFTSBZtEUTTvvflm7Ol8W5lkAwG1szkbZyOW
gJVOabvM2evLQ3rNEozCKmGchZztANl8dn42fdl5wISqLeasjNHfcI6yhEpg5jxY+lO4UIlI
n2HJvZArsQQ+GY2uuHQ2go1prDXYbPpEAYJWkCxEiI+iIh++NTySGrTPcUZ6LLltC2vvnAnv
jZYiUnK+seqHa+qKQktQTq4r8soasYtahf9qiHFnAAdboQ8gFJYAsTJZK7p3voNCrE1M7rdE
oIUewOBxrX3CzKiyaR9L7bHHoZ9dP5N3F1Zvzq1OTaWmk1VC233uriWg6S2C88hp1oMDQI1c
gUo9SUKIGg7MurxpAevemzEib16TwRm+r8ZBv49BR47Lf5Jj+Kn8Gw8sRQD1HAPdUic/rl+1
++Zy2E3pAhw/kP0Zrqs7NpI39+vsAwAA//8DAFBLAwQUAAYACAAAACEAtVUwI/UAAABMAgAA
CwDOAV9yZWxzLy5yZWxzIKLKASigAAIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

PS: exemple of my traces output
Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor 
getNextMail
INFO: Inside getNextMail before next()
Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor 
addPartToDocument
INFO: Inside addPartToDocument start
Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor 
addPartToDocument
INFO: Inside addPartToDocument.part instanceof message
Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor 
addPartToDocument
INFO: Inside addPartToDocument.isMimeType #1 multipart/MIXED; boundary=---------
---020306090503030103060209
Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor 
getNextMail
INFO: Inside getNextMail before next()
Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor 
addPartToDocument
INFO: Inside addPartToDocument start
Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor 
addPartToDocument
INFO: Inside addPartToDocument.part instanceof message
Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor 
addPartToDocument
INFO: Inside addPartToDocument.ELSE #1
Mar 18, 2013 1:50:02 PM org.apache.solr.handler.dataimport.MailEntityProcessor 
addPartToDocument
INFO: Inside addPartToDocument.ELSE #2



Leif Hetlesæther <leif <at> gurusoft.no> writes:

Reply via email to