Here's an example of what Upayavira is talking about. https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
It has some RDBMS bits, but you can take those out. Best, Erick On Wed, Dec 23, 2015 at 1:27 AM, Upayavira <u...@odoko.co.uk> wrote: > If your needs of Tika fall outside of those provided by the embedded > Tika, I would suggest you include Tika in your own ingestion pipeline, > and just post raw content to Solr. This will probably perform better > anyway, as you are otherwise using up valuable Solr resources to do your > extraction work, and, as you are seeing, have far less control over what > happens inside than you would if Tika was consumed by your own > application. > > Upayavira > > On Wed, Dec 23, 2015, at 03:11 AM, Zheng Lin Edwin Yeo wrote: >> Hi, >> >> I'm also facing the same issue as what you faced 2 months back, like able >> to extract the image content if there are in .jpg or .png format, but not >> able to extract the images in pdf, even after setting >> "extractInlineImages >> true" in the PDFParser.properties. >> >> Have you managed to find alternative solutions to this problem? >> >> Regards, >> Edwin >> >> On 22 October 2015 at 18:05, Damien Picard <picard.dam...@gmail.com> >> wrote: >> >> > Hi, >> > >> > I'm using Solr 5.3.0 on a Red Hat EL 7 and I try to extract content from >> > PDF, Word, LibreOffice, etc. docs using the ExtractingRequestHandler. >> > >> > Everything works fine, except when I want to extract content from embedding >> > images in PDF/Word etc. documents : >> > >> > I send an extract request like this : >> > POST /update/extract?literal.id >> > =ocrpdf8&fmap.content=attr_content&uprefix=attr_ >> > >> > In attr_content, I get : >> > \n \n date 2015-08-28T13:23:03Z \n >> > pdf:PDFVersion 1.4 \n >> > xmp:CreatorTool PDFCreator Version 1.2.3 \n >> > stream_content_type application/pdf \n >> > Keywords \n >> > subject \n >> > dc:creator S050735 \n >> > dcterms:created 2015-08-28T13:23:03Z \n >> > Last-Modified 2015-08-28T13:23:03Z \n >> > dcterms:modified 2015-08-28T13:23:03Z \n >> > dc:format application/pdf; version=1.4 \n >> > Last-Save-Date 2015-08-28T13:23:03Z \n >> > stream_name imagepdf.pdf \n >> > meta:save-date 2015-08-28T13:23:03Z \n >> > pdf:encrypted false \n >> > dc:title imagepdf \n >> > modified 2015-08-28T13:23:03Z \n >> > cp:subject \n >> > Content-Type application/pdf \n >> > stream_size 423660 \n >> > X-Parsed-By org.apache.tika.parser.DefaultParser \n >> > X-Parsed-By org.apache.tika.parser.pdf.PDFParser \n >> > creator S050735 \n >> > meta:author S050735 \n >> > dc:subject \n >> > meta:creation-date 2015-08-28T13:23:03Z \n >> > stream_source_info the-file \n >> > created Fri Aug 28 13:23:03 UTC 2015 \n >> > xmpTPg:NPages 1 \n >> > Creation-Date 2015-08-28T13:23:03Z \n >> > meta:keyword \n >> > Author S050735 \n >> > producer GPL Ghostscript 9.04 \n >> > imagepdf \n >> > \n >> > page \n >> > Page 1 sur 1\n \n >> > 28/08/2015 >> > http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4. >> > .. >> > \n \n embedded:image0.jpg image0.jpg embedded:image1.jpg image1.jpg >> > embedded:image2.jpg image2.jpg \n >> > >> > So, tika works fine, but it doesn't apply OCR content extraction on the >> > embedded images. >> > >> > When I post an image (JPG) on /update/extract, I get its content indexed >> > throught Tesseract OCR (attr_content) field : >> > \n \n stream_size 55422 \n >> > X-Parsed-By org.apache.tika.parser.DefaultParser \n >> > X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n >> > stream_content_type image/jpeg \n >> > stream_name OM_1.jpg \n >> > stream_source_info the-file \n >> > Content-Type image/jpeg \n \n \n >> > ‘ '\"I“ \" \"' ./\nlrast. Shortly before the classes started I was >> > visiting a.\ncertain public school, a school set in a typically >> > English\ncountryside, which on the June clay of my visit was wonder-\nfully >> > beauliful. The Head Master—-no less typical than his\nschool and the >> > country-side—pointed out the charms of\nboth, and his pride came out in the >> > final remark which he made\nbeforehe left me. He explained that he had a >> > class to take\nin'I'heocritus. Then (with a. buoyant gesture); “ Can >> > you\n\n, conceive anything more delightful than a class in >> > Theocritus,\n\non such a day and in such a place?\"\n\n \n \n \n >> > stream_size 55422 \n X-Parsed-By org.apache.tika.parser.DefaultParser \n >> > X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n X-Parsed-By >> > org.apache.tika.parser.jpeg.JpegParser \n stream_content_type image/jpeg \n >> > Resolution Units inch \n stream_source_info the-file \n Compression Type >> > Progressive, Huffman \n Data Precision 8 bits \n Number of Components 3 \n >> > tiff:ImageLength 286 \n Component 2 Cb component: Quantization table 1, >> > Sampling factors 1 horiz/1 vert \n Component 1 Y component: Quantization >> > table 0, Sampling factors 2 horiz/2 vert \n Image Height 286 pixels \n X >> > Resolution 72 dots \n Image Width 690 pixels \n stream_name OM_1.jpg \n >> > Component 3 Cr component: Quantization table 1, Sampling factors 1 horiz/1 >> > vert \n tiff:BitsPerSample 8 \n tiff:ImageWidth 690 \n Content-Type >> > image/jpeg \n Y Resolution 72 dots >> > >> > I see on Tika JIRA that I have to enable extractInlineImages in >> > org/apache/tika/parser/pdf/PDFParser.properties to force image extraction >> > on PDF. So I did it, and I package a tika-app-1.7.jar that contains the >> > tika-parsers-1.7.jar with this file modified to set to true this property. >> > Then, I test my Tika JAR using CLI : >> > >> > # java -jar tika-app-1.7.jar -t /data/docs/imagepdf.pdf >> > >> > In this case, I get the images content : >> > >> > >> > Page 1 sur 1 >> > >> > 28/08/2015 >> > http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4 >> > . >> > .. >> > >> > Simple Evan! >> > Use Case >> > Sdsedulet >> > >> > So, I replace the solr/contrib/extraction/lib/tika-parsers-1.7.jar by my >> > modified one, but the images remains not extracted in my pdf. >> > >> > Does anybody know what I'm doing wrong ? >> > >> > Thank you. >> > >> > -- >> > Damien Picard >> > Expert GWT >> > < >> > http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html >> > > >> > Mob : 06 11 51 47 78 >> >