[jira] [Commented] (SOLR-15039) Error in Solr Cell extract when using multipart upload with some documents

sam marshall (Jira) Thu, 10 Dec 2020 02:45:04 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247159#comment-17247159
 ]


sam marshall commented on SOLR-15039:
-------------------------------------

In case it is helpful to reproduce the problem, here is a complete sequence of 
commands that will reproduce it on starting from a fresh Ubuntu 18.04 
installation (I used a Microsoft Azure VM). It uses a fresh Solr 8.7.0 
installation with the supplied 'techproducts' sample, which has the extract 
handler enabled and I assume is correctly configured.

After creating the new VM, I copied the file b364b24b-public into the home 
directory, and then this is the full sequence of commands I needed to reproduce 
it (it doesn't quite run as a script, you have to press Y or Q at a couple of 
points):

{code}
sudo apt install openjdk-11-jdk
wget https://archive.apache.org/dist/lucene/solr/8.7.0/solr-8.7.0.tgz
tar xzf solr-8.7.0.tgz solr-8.7.0/bin/install_solr_service.sh 
--strip-components=2
sudo bash ./install_solr_service.sh solr-8.7.0.tgz
sudo su - solr -c "/opt/solr/bin/solr create -c testcollection -d 
sample_techproducts_configs"
curl 
"http://localhost:8983/solr/testcollection/update/extract?&extractOnly=true"; 
--data-binary '@b364b24b-public' -H 'Content-type:text/html' > 
nonmultipart-result.txt
curl 
"http://localhost:8983/solr/testcollection/update/extract?&extractOnly=true"; -F 
'myfile=@b364b24b-public' -H 'Content-type:text/html' > multipart-result.txt
{code}

After that point you can see the results in the two files, which are of clearly 
different sizes:

{code}
sam@solr-test-temp:~$ ls -l
total 212648
-rw-r--r-- 1 sam sam  10323956 Dec 10 10:32 b364b24b-public
-rwxr-xr-x 1 sam sam     12694 Oct 28 09:21 install_solr_service.sh
-rw-rw-r-- 1 sam sam   6589425 Dec 10 10:40 multipart-result.txt
-rw-rw-r-- 1 sam sam      9988 Dec 10 10:39 nonmultipart-result.txt
-rw-rw-r-- 1 sam sam 200805960 Oct 29 19:05 solr-8.7.0.tgz
{code}

> Error in Solr Cell extract when using multipart upload with some documents
> --------------------------------------------------------------------------
>
>                 Key: SOLR-15039
>                 URL: https://issues.apache.org/jira/browse/SOLR-15039
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 6.6.4, 8.4, 8.6.3, 8.7
>            Reporter: sam marshall
>            Priority: Major
>         Attachments: b364b24b-public
>
>
> (Note: I asked about this in the IRC channel as prompted, but didn't get a 
> response.)
> When uploading particular documents to /update/extract, you get different 
> (wrong) results if you are using multipart file upload compared to the basic 
> encoded upload, even though both methods are shown on the documentation page 
> ([https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html]).
> The first example in the documentation page uses a multipart POST with a 
> field called 'myfile' set to the file content. Some later examples use a 
> standard POST with the raw data provided.
> Here are these two approaches in the commands I used with my example file (I 
> have replaced the URL, username, password, and collection name for my Solr, 
> which isn't publicly available):
> {code}
> curl --user myuser:mypassword 
> "https://example.org/solr/mycollection/update/extract?&extractOnly=true"; 
> --data-binary '@c:/temp/b364b24b728b350eac18d6379ede3437fd220829' -H 
> 'Content-type:text/html' > nonmultipart-result.txt
> curl --user myuser:mypassword 
> "https://example.org/solr/mycollection/update/extract?&extractOnly=true"; -F 
> 'myfile=@c:/temp/b364b24b728b350eac18d6379ede3437fd220829' -H 
> 'Content-type:text/html' > multipart-result.txt
> {code}
> The example file is a ~10MB PowerPoint with a few sentences of English text 
> in it (and some pictures).
> The nonmultipart-result.txt file is 9,871 bytes long and JSON-encoded; it 
> includes an XHTML version of the text content of the PowerPoint, and some 
> metadata.
> The multipart-result.txt is 7,352,348 bytes long and contains mainly a large 
> sequence of Chinese characters, or at least, random data being interpreted as 
> Chinese characters.
> This example was running against Solr 8.4 on a Linux server from our cloud 
> Solr supplier. On another Linux (Ubuntu 18) server that I set up myself I got 
> the same results using various other Solr versions. Running against localhost 
> which is a Windows 10 machine with Solr 8.5, I get slightly different 
> results; the non-multipart works correctly but the multipart-result.txt in 
> that case is a slightly more helpful error 500 message:
> {code}
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader">
>   <int name="status">500</int>
>   <int name="QTime">138</int>
> </lst>
> <lst name="error">
>   <lst name="metadata">
>     <str name="error-class">org.apache.solr.common.SolrException</str>
>     <str name="root-error-class">java.util.zip.ZipException</str>
>   </lst>
>   <str name="msg">org.apache.tika.exception.TikaException: Error creating 
> OOXML extractor</str>
>   <str name="trace">org.apache.solr.common.SolrException: 
> org.apache.tika.exception.TikaException: Error creating OOXML extractor
> ...
> Caused by: java.util.zip.ZipException: Unexpected record signature: 0X2D2D2D2D
>         at 
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:260)
> {code}
> My conclusion is that even though both versions of this command (-F 
> myfile=@file, and --data-binary @file) are shown in the documentation, they 
> clearly don't work equally.
> Note: Although I've reproduced this using command-line curl to simplify this 
> report, this is actually the result of a highly tortuous debugging process 
> where I eventually managed to track down why a search index (generated by an 
> open source learning system, Moodle, which currently uses the multipart post 
> approach although I might have to change that) was using up too much disk 
> space...
> I'm going to try to remove private data from the offending file and attach it 
> here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-15039) Error in Solr Cell extract when using multipart upload with some documents

Reply via email to