sam marshall created SOLR-15039:
-----------------------------------

             Summary: Error in Solr Cell extract when using multipart upload 
with some documents
                 Key: SOLR-15039
                 URL: https://issues.apache.org/jira/browse/SOLR-15039
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: contrib - Solr Cell (Tika extraction)
    Affects Versions: 8.7, 8.6.3, 8.4, 6.6.4
            Reporter: sam marshall


(Note: I asked about this in the IRC channel as prompted, but didn't get a 
response.)

When uploading particular documents to /update/extract, you get different 
(wrong) results if you are using multipart file upload compared to the basic 
encoded upload, even though both methods are shown on the documentation page 
([https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html]).

The first example in the documentation page uses a multipart POST with a field 
called 'myfile' set to the file content. Some later examples use a standard 
POST with the raw data provided.

Here are these two approaches in the commands I used with my example file (I 
have replaced the URL, username, password, and collection name for my Solr, 
which isn't publicly available):

{code}
curl --user myuser:mypassword 
"https://example.org/solr/mycollection/update/extract?&extractOnly=true"; 
--data-binary '@c:/temp/b364b24b728b350eac18d6379ede3437fd220829' -H 
'Content-type:text/html' > nonmultipart-result.txt

curl --user myuser:mypassword 
"https://example.org/solr/mycollection/update/extract?&extractOnly=true"; -F 
'myfile=@c:/temp/b364b24b728b350
eac18d6379ede3437fd220829' -H 'Content-type:text/html' > multipart-result.txt
{code}

The example file is a ~10MB PowerPoint with a few sentences of English text in 
it (and some pictures).

The nonmultipart-result.txt file is 9,871 bytes long and JSON-encoded; it 
includes an XHTML version of the text content of the PowerPoint, and some 
metadata.

The multipart-result.txt is 7,352,348 bytes long and contains mainly a large 
sequence of Chinese characters, or at least, random data being interpreted as 
Chinese characters.

This example was running against Solr 8.4 on a Linux server from our cloud Solr 
supplier. On another Linux (Ubuntu 18) server that I set up myself I got the 
same results using various other Solr versions. Running against localhost which 
is a Windows 10 machine with Solr 8.5, I get slightly different results; the 
non-multipart works correctly but the multipart-result.txt in that case is a 
slightly more helpful error 500 message:

{code}
<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">500</int>
  <int name="QTime">138</int>
</lst>
<lst name="error">
  <lst name="metadata">
    <str name="error-class">org.apache.solr.common.SolrException</str>
    <str name="root-error-class">java.util.zip.ZipException</str>
  </lst>
  <str name="msg">org.apache.tika.exception.TikaException: Error creating OOXML 
extractor</str>
  <str name="trace">org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Error creating OOXML extractor
...
Caused by: java.util.zip.ZipException: Unexpected record signature: 0X2D2D2D2D
        at 
org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:260)
{code}

My conclusion is that even though both versions of this command (-F 
myfile=@file, and --data-binary @file) are shown in the documentation, they 
clearly don't work equally.

Note: Although I've reproduced this using command-line curl to simplify this 
report, this is actually the result of a highly tortuous debugging process 
where I eventually managed to track down why a search index (generated by an 
open source learning system, Moodle, which currently uses the multipart post 
approach although I might have to change that) was using up too much disk 
space...

I'm going to try to remove private data from the offending file and attach it 
here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to