[ 
https://issues.apache.org/jira/browse/SOLR-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246701#comment-17246701
 ] 

sam marshall commented on SOLR-15039:
-------------------------------------

I've attached a version of a file that causes these problems, hopefully with 
private data (author name) removed. If you rename it to .ppt it will open in 
Powerpoint, so to that extent it's valid (the presentation is clearly 
unfinished). 

Please do not use the file for anything other than testing this Solr bug, my 
employer retains the copyright etc.

> Error in Solr Cell extract when using multipart upload with some documents
> --------------------------------------------------------------------------
>
>                 Key: SOLR-15039
>                 URL: https://issues.apache.org/jira/browse/SOLR-15039
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 6.6.4, 8.4, 8.6.3, 8.7
>            Reporter: sam marshall
>            Priority: Major
>         Attachments: b364b24b-public
>
>
> (Note: I asked about this in the IRC channel as prompted, but didn't get a 
> response.)
> When uploading particular documents to /update/extract, you get different 
> (wrong) results if you are using multipart file upload compared to the basic 
> encoded upload, even though both methods are shown on the documentation page 
> ([https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html]).
> The first example in the documentation page uses a multipart POST with a 
> field called 'myfile' set to the file content. Some later examples use a 
> standard POST with the raw data provided.
> Here are these two approaches in the commands I used with my example file (I 
> have replaced the URL, username, password, and collection name for my Solr, 
> which isn't publicly available):
> {code}
> curl --user myuser:mypassword 
> "https://example.org/solr/mycollection/update/extract?&extractOnly=true"; 
> --data-binary '@c:/temp/b364b24b728b350eac18d6379ede3437fd220829' -H 
> 'Content-type:text/html' > nonmultipart-result.txt
> curl --user myuser:mypassword 
> "https://example.org/solr/mycollection/update/extract?&extractOnly=true"; -F 
> 'myfile=@c:/temp/b364b24b728b350
> eac18d6379ede3437fd220829' -H 'Content-type:text/html' > multipart-result.txt
> {code}
> The example file is a ~10MB PowerPoint with a few sentences of English text 
> in it (and some pictures).
> The nonmultipart-result.txt file is 9,871 bytes long and JSON-encoded; it 
> includes an XHTML version of the text content of the PowerPoint, and some 
> metadata.
> The multipart-result.txt is 7,352,348 bytes long and contains mainly a large 
> sequence of Chinese characters, or at least, random data being interpreted as 
> Chinese characters.
> This example was running against Solr 8.4 on a Linux server from our cloud 
> Solr supplier. On another Linux (Ubuntu 18) server that I set up myself I got 
> the same results using various other Solr versions. Running against localhost 
> which is a Windows 10 machine with Solr 8.5, I get slightly different 
> results; the non-multipart works correctly but the multipart-result.txt in 
> that case is a slightly more helpful error 500 message:
> {code}
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader">
>   <int name="status">500</int>
>   <int name="QTime">138</int>
> </lst>
> <lst name="error">
>   <lst name="metadata">
>     <str name="error-class">org.apache.solr.common.SolrException</str>
>     <str name="root-error-class">java.util.zip.ZipException</str>
>   </lst>
>   <str name="msg">org.apache.tika.exception.TikaException: Error creating 
> OOXML extractor</str>
>   <str name="trace">org.apache.solr.common.SolrException: 
> org.apache.tika.exception.TikaException: Error creating OOXML extractor
> ...
> Caused by: java.util.zip.ZipException: Unexpected record signature: 0X2D2D2D2D
>         at 
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:260)
> {code}
> My conclusion is that even though both versions of this command (-F 
> myfile=@file, and --data-binary @file) are shown in the documentation, they 
> clearly don't work equally.
> Note: Although I've reproduced this using command-line curl to simplify this 
> report, this is actually the result of a highly tortuous debugging process 
> where I eventually managed to track down why a search index (generated by an 
> open source learning system, Moodle, which currently uses the multipart post 
> approach although I might have to change that) was using up too much disk 
> space...
> I'm going to try to remove private data from the offending file and attach it 
> here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to