Hi Tim, Regarding the returning of the list of Metadata objects, is the code suppose to include the information on the number of attachments in the particular email and/or the name of the attachment? For example, if there are 3 attachments in the email, we should be able to see immediately from the Metadata that there are attachments, and there are 3 of them.
Thank you. Regards, Edwin On Sat, 3 Aug 2019 at 07:19, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Thanks for the reply, will find out more about it. > > Currently I am able to retrieve the normal Metadata of the email, but not > the Metadata of the attachments which are part of the contents in the EML > file, which looks something like this. > > --000000000000d8b77b057d59ca19-- > > --000000000000d8b77e057d59ca1b > Content-Type: application/pdf; name="file1.pdf" > Content-Disposition: attachment; filename="file1.pdf" > Content-Transfer-Encoding: base64 > Content-ID: <f_jpurtpnk0> > X-Attachment-Id: f_jpurtpnk0 > > Regards, > Edwin > > On Sat, 3 Aug 2019 at 05:38, Tim Allison <talli...@apache.org> wrote: > >> I'd strongly recommend rolling your own ingest code. See Erick's >> superb: https://lucidworks.com/post/indexing-with-solrj/ >> >> You can easily get attachments via the RecursiveParserWrapper, e.g. >> >> https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java#L351 >> >> This will return a list of Metadata objects; the first one will be the >> main/container, each other entry will be an attachment. Let us know >> if you have any questions/surprises. There are a couple of todos for >> .eml... >> >> On Fri, Aug 2, 2019 at 3:43 AM Jan Høydahl <jan....@cominvent.com> wrote: >> > >> > Try the Apache Tika mailing list. >> > >> > -- >> > Jan Høydahl, search solution architect >> > Cominvent AS - www.cominvent.com >> > >> > > 2. aug. 2019 kl. 05:01 skrev Zheng Lin Edwin Yeo < >> edwinye...@gmail.com>: >> > > >> > > Hi, >> > > >> > > Does anyone knows if this can be done on the Solr side? >> > > Or it has to be done on the Tika side? >> > > >> > > Regards, >> > > Edwin >> > > >> > > On Thu, 1 Aug 2019 at 09:38, Zheng Lin Edwin Yeo < >> edwinye...@gmail.com> >> > > wrote: >> > > >> > >> Hi, >> > >> >> > >> Would like to check, Is there anyway which we can detect the number >> of >> > >> attachments and their names during indexing of EML files in Solr, >> and index >> > >> those information into Solr? >> > >> >> > >> Currently, Solr is able to use Tika and Tesseract OCR to extract the >> > >> contents of the attachments. However, I could not find the >> information >> > >> about the number of attachments in the EML file and what are their >> filename. >> > >> >> > >> I am using Solr 7.6.0 in production, and also trying out on the new >> Solr >> > >> 8.2.0. >> > >> >> > >> Regards, >> > >> Edwin >> > >> >> > >> >