More properly,it would be best to fix Tika and thus not push extra
complexity upon many many users. Error handling is one thing, crashes
though ought to be designed out.
Thanks,
Joe D.
On 25/08/2020 10:54, Charlie Hull wrote:
On 25/08/2020 06:04, Srinivas Kashyap wrote:
Hi Alexandre,
Yes, these are the same PDF files running in windows and linux. There
are around 30 pdf files and I tried indexing single file, but faced
same error. Is it related to how PDF stored in linux?
Did you try running Tika (the same version as you're using in Solr)
standalone on the file as Alexandre suggested?
And with regard to DIH and TIKA going away, can you share if any
program which extracts from PDF and pushes into solr?
https://lucidworks.com/post/indexing-with-solrj/ is one example. You
should run Tika separately as it's entirely possible for it to fail to
parse a PDF and crash - and if you're running it in DIH & Solr it then
brings down everything. Separate your PDF processing from your Solr
indexing.
Cheers
Charlie
Thanks,
Srinivas Kashyap
-----Original Message-----
From: Alexandre Rafalovitch <arafa...@gmail.com>
Sent: 24 August 2020 20:54
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: PDF extraction using Tika
The issue seems to be more with a specific file and at the level way
below Solr's or possibly even Tika's:
Caused by: java.io.IOException: expected='>' actual='
' at offset 2383
at
org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045)
Are you indexing the same files on Windows and Linux? I am guessing
not. I would try to narrow down which of the files it is. One way
could be to get a standalone Tika (make sure to match the version Solr
embeds) and run it over the documents by itself. It will probably
complain with the same error.
Regards,
Alex.
P.s. Additionally, both DIH and Embedded Tika are not recommended for
production. And both will be going away in future Solr versions. You
may have a much less brittle pipeline if you save the structured
outputs from those Tika standalone runs and then index them into
Solr, possibly pre-processed.
On Mon, 24 Aug 2020 at 11:09, Srinivas Kashyap
<srini...@bamboorose.com.invalid> wrote:
Hello,
We are using TikaEntityProcessor to extract the content out of PDF
and make the content searchable.
When jetty is run on windows based machine, we are able to
successfully load documents using full import DIH(tika entity). Here
PDF's is maintained in windows file system.
But when jetty solr is run on linux machine, and try to run DIH, we
are getting below exception: (Here PDF's are maintained in linux
filesystem)
Full Import failed:java.lang.RuntimeException:
java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to read content Processing Document # 1
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to read content Processing Document # 1
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
... 4 more
Caused by:
org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to read content Processing Document # 1
at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:171)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
... 6 more
Caused by: org.apache.tika.exception.TikaException: Unable to
extract PDF content
at
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
at
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaE
�b���u:�����k(�D^q�y���M6k^�ӯ4w�4�L"��L��km��v�^����2�����M��~}��;�C�Mq��
^JkiJ��q�v��4��m��n�'p�}��ȡƙB)j`�9�r�.꣣`�Vs�
�A��?ܶr`��yǢ��[�m4٧<����R13�u�y��u�~6�d�͵��7������2�����M��~}��;�C�LBI楈
����ן|�;��/zyТ�%y�&�sm5�O5���ڪ*���9�ԇ�
;ö��m9~�U�Ǒ3�'t�R.}��}�d���#Nq���]�p��|��l)��p���d*��d�G��+�"�������Q��،O�Fe���CtUY��HA�����Ӗ+#�ˤ��F^f�obW��4Z�A���<m�`�'4uu��7Z1��63�R�(!�:Y�c�b��D�1$�.uf02������Z]������5Wz�˲�|���RT�����Y�VV�_�~.zMTE�{;0��D#�Ơy(���n��ՠ��!knzp
��^w�ޕ�^uТ�%y�&�sm5�O5��˛��-��,j�u֭y��h�z���*'vH��('j۫yץ���y�h��e��%��bvX���,�X���칻�&ޖ+-��i���yקq隊X��X���)^h),���
BC��[���<�D{7��%��l
����z
�pv�\�ZB�k;�C����*�
j-a4���_1��7��L��Ki�=]ph�lP�?���/.�!��#���y)�T�ѝFDD?��B��'Ge,�54����
���
D3����&��@�*�Z5�bbֲo��(v��&�y���3XW5�����'��(|=���ۘ�)�t%��
oh��N��T�é����ւ}���1{S=:|���Br�2��{'�/Q��#cA�����ƭ���z˥���l`��%y�&vH�w���n�r��!v'g��ޮ'��z��f���&��z��Mz�M���z��Ӊ蒊����,��
��(��&j)�~�%���������t��Z�X����.�[�zw���!z��u�"�v�zʹ�n8��yݫ)z���ם���z�&����W�`�Z���z۫�^x�M�������\��%�ǧy�Zr��fj�}���4K)DS
���8�!y�^���j)\�d^���=�a��k�ǫ��n�}y�6r��'~V�r������i�^��yǢ��_���j)k��^�\.��Zr��
��Z���j����+�m;ێ<�^wo)���
��(���D�L��w�l���t��8�M���~}����Z�X�~V�r����슉L���ȩ��L��-��`�Hp7������(�n�
��M���}ӽ4<4�yǢ��R<ZZ��(�
^r���f����Z���z۫�^x�M�������\��%�ǧy�Zr��^�('j�N��<םڲ���+my��rX��ةmێ<�^w�az{bq�b�t^��m�l`��%y�&vH�w���n�r��!v'g��ޮ'��z��f���&��z��Mz�M���z��Ӊ蒊����,��
��(��&j)�~�%���������t��Z�X����.�[�zw���!z��u�"�v�zʹ�n8��yݫ)z���ם���z�&����W�`�Z���z۫�^x�M�������\��%�ǧy�Zr��fj�}���4K)DS
���8�!y�^���j)\�d^q�y��i��V�i�^��%�ƥ��-�n��V���{Yp�Ʃi�^��HS�T����jٚ�Z���j����+���#>'u-4�nt���Z�X�~V�r��N����M�Ӯ}ׯ��ML$^q�yڦj)y��8�{��[���M6���M�Ӯ}�}4�C�X������'��-��k�ǫ��i����i�^��+�v�{9����q�^�Ǜ�Y��GzZfj)m��%�����[�zw���!z��.+-R{.n�+���j)m��%����칻�&ޖ���i�^����O��fj)m��%���[�zw���!z��.+-!�(���z�nq��j����+���$貉k�ǫ����i�^��zX�z��N���X��X���%���[�zw���!z��E�����f�����h���u����}ې.�m6�N��]��MyǢ��_�����n����z����Z��!z��Bβ��w[���-j����+�m=��5�^6o&�i�^��*�ɭ�����"�13ӹ���6�m4�u��4�D^q�y��ihq�a��e�ƥ��-�n��Vw[���-j����+��1��I��z�����w[���-j����+���D�L���.=�-7��%���[�zw���!z��N����M�Ӯ}�o��ML%Պ��IƧ��Cy������+����Z��n����Zr��ҥ����
9t�jd����ەҥ��z��]*Z�+Z��
��(��ۛZ��i�/x����*�t���(� cD
�T�
�T��P� �@,��
0�ED�40���nGL��$�@���C5%�4�DM�T�q8Ӎ
�T��H
�V���^j�v+nW��az{bq�b�t^��m��Z��n����Zr���j��vw�vH���m8�y�^�Й��r��yǢ��_����X^u�i�^��5Ӎ4ƥ��-���w[���-j����+�]8�Nڙ����+�]4ی"�13�v|���q��Dy�+��k�ǫ����i�^������6�m4�u��4�E
�z+�u#�=�,���~�&�ק�+rf����bz{b�
ӷ�oN<덴wg�zZ�����qwg���%y�&z{ޖ�^~�&J���t�f���&�����C4�E����~�&�����qwg���%y�&�����qwg���%y�&
��ۭ;���θ�Gvy���X^u�i�^��HS�T����jٱ��y�Zr��"�13҉���;�~��Z���ǧy�Zr���{n@�
��M:��t�M4�D�E����o)��X��vz
(�W��l"��L���k(�^5�^}��6~��Z���ǧy�Zr��(�n�
��M���tӽ4<4�(�� ڶ�
���秭�Z����Z��]
�W(���^��nahz���*'~�&u�^��,j�v˛��-���+��;�O��Ꙃ��Io�Y���T��/)�����z$��t�3�z��j&�������y��a�+*{J>(U1`���/C��������UV�r-�;��!��Rޤ������5i������^��P;�����:5_Te�m��Hw��T��fo�k���fp���!��2dG�`ڦD�S���tt�W�|��s�0�E�WaS5�xR���r���sS�S�ޭ��=���7�/�������R�yUk�5q>4��t��;wR|{�)�v�{�|+P@N:ɦ���
��� Ģ�v�����Ƭ�����zV�y��zV�y���t�w��
tۘ�
�z���-jצ�g�z�"�w�gZ�鞲Ơz'l���r�hn�:y3�D��^��/�t��8�����L��*+@�Q�-k�.Q$��o~06^x����V�\��sdk�m��6Le:�*��з����?,(�:Ҷx�FY�3*J���7��o�r�\y���I�.�ނm.�O��X
�!�?ܓ��ץ�eZ���ڧ'�g`��o����D�v�zѼ�}Z��
Q�TH�</�WZ��H�����W॑f��H�������Ln�NR�3W��P�+�)j�x��eJ���`���ч=���m�$�a�v����툟}h�R�q1�,j���^��}��W��ɤ�nyDs��H2dv�3�D�<�BE���o���������Lo��y�'��yb��4�u�R����%hhN]�z+�u���M�Ӟ��ͺ��"��L���٫����N�j��ן|�;�}|��'ۀ.�m6�m��M��M
30E^�Ȩ�]����+!jx(ɩd�+!j}w�H&j)\�`ڵ��{n@�
��Mv��u�Nw�Ǭ��
0�&�xd|�ݭ�2��jRQ����G�����=Ő����f���&J��y�B�歊x�ڱ隵�_y�n��-4S��%���[�zw���!z��
��z{Sʗ��[b�����mz�ڶ+�n��u��M�M4�M4�NӾ��Z{w^�M4�M4�M4�NӾ��Z{w^�*'���O*^��m��Z�w!j����!���(�����y�a����\����������hzȧv+&k�v+&kZ�ǫ����v�����������+n�v�z�kjnjj�!rV����j����(��m������J�k歺��j����(��^ƘkjǢ���z��窹��~'�t֦z)��-���P-4��ڊ[0��m��q�m����ǩ�*'~�&r���垊m~�ҢZ�y�n��ڙ����xƭrZ����je{a���s�u궻��������Z�触�g�YA��"�)Z����,(�K0����b��^��ʷ�jy"�(��k��p��m����ښ���&�W��ڱ�:�^���[��������Z��(��^�ݵ�E�jج�����zh���M(��������ߢ�ay������M4�M4�M4ӆ��9i֞�ntityProcessor.java:165)
... 10 more
Caused by: java.io.IOException: expected='>' actual='
' at offset 2383
at
org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045)
at
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:226)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:163)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:510)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
at
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at
org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
at
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
... 15 more
Can you please suggest, how to extract PDF from linux based file
system?
Thanks,
Srinivas Kashyap
________________________________
DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender
immediately by replying to the e-mail, and then delete it without
making copies or using it in any way.
No representation is made that this email or any attachments are
free of viruses. Virus scanning is recommended and is the
responsibility of the recipient.
Disclaimer
The information contained in this communication from the sender is
confidential. It is intended solely for use by the recipient and
others authorized to receive it. If you are not the recipient, you
are hereby notified that any disclosure, copying, distribution or
taking action in relation of the contents of this information is
strictly prohibited and may be unlawful.
This email has been scanned for viruses and malware, and may have
been automatically archived by Mimecast Ltd, an innovator in
Software as a Service (SaaS) for business. Providing a safer and
more useful place for your human generated data. Specializing in;
Security, archiving and compliance. To find out more visit the
Mimecast website.