Hi Alex,
Thanks a lot for your help!
I have tested the same using the 'techproducts' example as proposed, and
it worked fine.
You are right, the documentation seems to be outdated in this aspect.
I have just reviewed the solrconfig.xml of the 'schemaless' example and
found all the Solr Cell config was completely missing.
After adding it as described at
https://lucene.apache.org/solr/guide/8_8/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-extractingrequesthandler-in-solrconfig-xml
everything worked fine again.
What can I do to help updating the docs?
Best regards,
Leon
Am 05.02.21 um 16:15 schrieb Alexandre Rafalovitch:
I think the extract handler is not defined in schemaless. This may be
a change from before and the documentation is out of sync.
Can you try 'techproducts' example instead of schemaless:
bin/solr stop (if you are still running it)
bin/solr start -e techproducts
Then the import command.
The Tika integration is defined in solrconfig.xml and needs both
handler defined and some libraries loaded. Once you confirmed you like
what you see, you can copy those into whatever configuration you are
working with.
Regards,
Alex.
On Fri, 5 Feb 2021 at 07:38, nq <nq@uber.space> wrote:
Hi,
I am new to Solr and tried to follow the guide to upload PDF data using
Tika, on Solr 8.7.0 (running on Debian 10):
https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html
but I get an HTTP 404 error when trying to import the file.
In the solr installation directory, after spinning up the example server
using
solr/bin/solr -e schemaless
I firstly used the Post Tool to index a PDF file as described in the
guide, giving the following output (paths truncated using “[…]” for
privacy reasons):
bin/post -c gettingstarted example/exampledocs/solr-word.pdf -params
"literal.id=doc1"
java -classpath /[…]/solr-8.7.0/dist/solr-core-8.7.0.jar -Dauto=yes
-Dparams=literal.id=doc1 -Dc=gettingstarted -Ddata=files org.apa
che.solr.util.SimplePostTool example/exampledocs/solr-word.pdf
SimplePostTool version 5.0.0
Posting files to [base] url
http://localhost:8983/solr/gettingstarted/update?literal.id=doc1...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file solr-word.pdf (application/pdf) to [base]/extract
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for
url:
http://localhost:8983/solr/gettingstarted/update/extract?literal.id=doc1&r
esource.name=%2F[…]%2Fsolr-8.7.0%2Fexample%2Fexampledocs%2Fsolr-word.pdf
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/solr/gettingstarted/update/extract</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>default</td></tr>
</table>
</body>
</html>
SimplePostTool: WARNING: IOException while reading response:
java.io.FileNotFoundException:
http://localhost:8983/solr/gettingstarted/update/extract
?literal.id=doc1&resource.name=%2F[…]%2Fsolr-8.7.0%2Fexample%2Fexampledocs%2Fsolr-word.pdf
1 files indexed.
COMMITting Solr index changes to
http://localhost:8983/solr/gettingstarted/update?literal.id=doc1...
Time spent: 0:00:00.038
resulting in no actual changes being visible in the Solr.
Using curl results in the same HTTP response:
curl
'http://localhost:8983/solr/gettingstarted/update/extract?literal.id=doc1&commit=true'
-F "myfile=@example
/exampledocs/solr-word.pdf"
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/solr/gettingstarted/update/extract</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>default</td></tr>
</table>
</body>
</html>
Sorry if this has already been discussed somewhere; I have not been able
to find anything helpful yet.
Thank you!
Leon