[ https://issues.apache.org/jira/browse/SOLR-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
sam marshall updated SOLR-15039: -------------------------------- Attachment: b364b24b-public > Error in Solr Cell extract when using multipart upload with some documents > -------------------------------------------------------------------------- > > Key: SOLR-15039 > URL: https://issues.apache.org/jira/browse/SOLR-15039 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: contrib - Solr Cell (Tika extraction) > Affects Versions: 6.6.4, 8.4, 8.6.3, 8.7 > Reporter: sam marshall > Priority: Major > Attachments: b364b24b-public > > > (Note: I asked about this in the IRC channel as prompted, but didn't get a > response.) > When uploading particular documents to /update/extract, you get different > (wrong) results if you are using multipart file upload compared to the basic > encoded upload, even though both methods are shown on the documentation page > ([https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html]). > The first example in the documentation page uses a multipart POST with a > field called 'myfile' set to the file content. Some later examples use a > standard POST with the raw data provided. > Here are these two approaches in the commands I used with my example file (I > have replaced the URL, username, password, and collection name for my Solr, > which isn't publicly available): > {code} > curl --user myuser:mypassword > "https://example.org/solr/mycollection/update/extract?&extractOnly=true" > --data-binary '@c:/temp/b364b24b728b350eac18d6379ede3437fd220829' -H > 'Content-type:text/html' > nonmultipart-result.txt > curl --user myuser:mypassword > "https://example.org/solr/mycollection/update/extract?&extractOnly=true" -F > 'myfile=@c:/temp/b364b24b728b350 > eac18d6379ede3437fd220829' -H 'Content-type:text/html' > multipart-result.txt > {code} > The example file is a ~10MB PowerPoint with a few sentences of English text > in it (and some pictures). > The nonmultipart-result.txt file is 9,871 bytes long and JSON-encoded; it > includes an XHTML version of the text content of the PowerPoint, and some > metadata. > The multipart-result.txt is 7,352,348 bytes long and contains mainly a large > sequence of Chinese characters, or at least, random data being interpreted as > Chinese characters. > This example was running against Solr 8.4 on a Linux server from our cloud > Solr supplier. On another Linux (Ubuntu 18) server that I set up myself I got > the same results using various other Solr versions. Running against localhost > which is a Windows 10 machine with Solr 8.5, I get slightly different > results; the non-multipart works correctly but the multipart-result.txt in > that case is a slightly more helpful error 500 message: > {code} > <?xml version="1.0" encoding="UTF-8"?> > <response> > <lst name="responseHeader"> > <int name="status">500</int> > <int name="QTime">138</int> > </lst> > <lst name="error"> > <lst name="metadata"> > <str name="error-class">org.apache.solr.common.SolrException</str> > <str name="root-error-class">java.util.zip.ZipException</str> > </lst> > <str name="msg">org.apache.tika.exception.TikaException: Error creating > OOXML extractor</str> > <str name="trace">org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: Error creating OOXML extractor > ... > Caused by: java.util.zip.ZipException: Unexpected record signature: 0X2D2D2D2D > at > org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:260) > {code} > My conclusion is that even though both versions of this command (-F > myfile=@file, and --data-binary @file) are shown in the documentation, they > clearly don't work equally. > Note: Although I've reproduced this using command-line curl to simplify this > report, this is actually the result of a highly tortuous debugging process > where I eventually managed to track down why a search index (generated by an > open source learning system, Moodle, which currently uses the multipart post > approach although I might have to change that) was using up too much disk > space... > I'm going to try to remove private data from the offending file and attach it > here. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org