The migration of Tika to the latest 0.8 version seems to have reintroduced the issue.
I was able to get this working again with the following patches. (Solr Cell and Data Import handler) https://issues.apache.org/jira/browse/SOLR-2416 https://issues.apache.org/jira/browse/SOLR-2332 You can try these. Regards, Jayendra On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel <phan...@nearinfinity.com> wrote: > Hi Gary, > > I have been experiencing the same problem... Unable to extract content from > archive file formats. I just tried again with a clean install of Solr 3.1.0 > (using Tika 0.8) and continue to experience the same results. Did you have > any success with this problem with Solr 1.4.1 or 3.1.0 ? > > I'm using this curl command to send data to Solr. > curl " > http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true" > -H "application/octet-stream" -F "myfile=@data.zip" > > No problem extracting single rich text documents, but archive files only > result in the file names within the archive being indexed. Am I missing > something else in my configuration? Solr doesn't seem to be unpacking the > archive files. Based on the email chain associated with your first message, > some people have been able to get this functionality to work as desired. > > On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor <g...@inovem.com> wrote: > >> Can anyone shed any light on this, and whether it could be a config issue? >> I'm now using the latest SVN trunk, which includes the Tika 0.8 jars. >> >> When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) to >> the ExtractingRequestHandler, I get the following log entry (formatted for >> ease of reading) : >> >> SolrInputDocument[ >> { >> ignored_meta=ignored_meta(1.0)={ >> [stream_source_info, file, stream_content_type, >> application/octet-stream, stream_size, 260, stream_name, solr1.zip, >> Content-Type, application/zip] >> }, >> ignored_=ignored_(1.0)={ >> [package-entry, package-entry] >> }, >> ignored_stream_source_info=ignored_stream_source_info(1.0)={file}, >> >> ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream}, >> >> ignored_stream_size=ignored_stream_size(1.0)={260}, >> ignored_stream_name=ignored_stream_name(1.0)={solr1.zip}, >> ignored_content_type=ignored_content_type(1.0)={application/zip}, >> docid=docid(1.0)={74}, >> type=type(1.0)={5}, >> text=text(1.0)={ doc2.txt doc1.txt } >> } >> ] >> >> So, the data coming back from Tika when parsing a ZIP file does not include >> the file contents, only the names of the files contained therein. I've >> tried forcing stream.type=application/zip in the CURL string, but that makes >> no difference. If I specify an invalid stream.type then I get an exception >> response, so I know it's being used. >> >> When I send one of those txt files individually to the >> ExtractingRequestHandler, I get: >> >> SolrInputDocument[ >> { >> ignored_meta=ignored_meta(1.0)={ >> [stream_source_info, file, stream_content_type, text/plain, >> stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt] >> }, >> ignored_stream_source_info=ignored_stream_source_info(1.0)={file}, >> >> ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain}, >> ignored_stream_size=ignored_stream_size(1.0)={30}, >> ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1}, >> ignored_stream_name=ignored_stream_name(1.0)={doc1.txt}, >> docid=docid(1.0)={74}, >> type=type(1.0)={5}, >> text=text(1.0)={ The quick brown fox } >> } >> ] >> >> and we see the file contents in the "text" field. >> >> I'm using the following requestHandler definition in solrconfig.xml: >> >> <!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler --> >> <requestHandler name="/update/extract" >> class="org.apache.solr.handler.extraction.ExtractingRequestHandler" >> startup="lazy"> >> <lst name="defaults"> >> <!-- All the main content goes into "text"... if you need to return >> the extracted text or do highlighting, use a stored field. --> >> <str name="fmap.content">text</str> >> <str name="lowernames">true</str> >> <str name="uprefix">ignored_</str> >> >> <!-- capture link hrefs but ignore div attributes --> >> <str name="captureAttr">true</str> >> <str name="fmap.a">links</str> >> <str name="fmap.div">ignored_</str> >> </lst> >> </requestHandler> >> >> Is there any further debug or diagnostic I can get out of Tika to help me >> work out why it's only returning the file names and not the file contents >> when parsing a ZIP file? >> >> >> Thanks and kind regards, >> Gary. >> >> >> >> On 25/01/2011 16:48, Jayendra Patil wrote: >> >>> Hi Gary, >>> >>> The latest Solr Trunk was able to extract and index the contents of the >>> zip >>> file using the ExtractingRequestHandler. >>> The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and >>> worked pretty well. >>> >>> Tested again with sample url and works fine - >>> curl " >>> >>> http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zip&literal.id=777045&literal.title=Test&commit=true >>> " >>> >>> You would probably need to drill down to the Tika Jars and >>> the apache-solr-cell-4.0-dev.jar used for Rich documents indexing. >>> >>> Regards, >>> Jayendra >>> >>> >> >