Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Joey Hanzel Sun, 10 Apr 2011 19:35:59 -0700

Hi Gary,

I have been experiencing the same problem... Unable to extract content from
archive file formats.  I just tried again with a clean install of Solr 3.1.0
(using Tika 0.8) and continue to experience the same results.  Did you have
any success with this problem with Solr 1.4.1 or 3.1.0 ?


I'm using this curl command to send data to Solr.
curl "
http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true";
-H "application/octet-stream" -F  "myfile=@data.zip"

No problem extracting single rich text documents, but archive files only
result in the file names within the archive being indexed. Am I missing
something else in my configuration? Solr doesn't seem to be unpacking the
archive files. Based on the email chain associated with your first message,
some people have been able to get this functionality to work as desired.

On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor <g...@inovem.com> wrote:

> Can anyone shed any light on this, and whether it could be a config issue?
>  I'm now using the latest SVN trunk, which includes the Tika 0.8 jars.
>
> When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) to
> the ExtractingRequestHandler, I get the following log entry (formatted for
> ease of reading) :
>
> SolrInputDocument[
>    {
>    ignored_meta=ignored_meta(1.0)={
>        [stream_source_info, file, stream_content_type,
> application/octet-stream, stream_size, 260, stream_name, solr1.zip,
> Content-Type, application/zip]
>        },
>    ignored_=ignored_(1.0)={
>        [package-entry, package-entry]
>        },
>    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
>
>  
> ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},
>
>    ignored_stream_size=ignored_stream_size(1.0)={260},
>    ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
>    ignored_content_type=ignored_content_type(1.0)={application/zip},
>    docid=docid(1.0)={74},
>    type=type(1.0)={5},
>    text=text(1.0)={                  doc2.txt    doc1.txt    }
>    }
> ]
>
> So, the data coming back from Tika when parsing a ZIP file does not include
> the file contents, only the names of the files contained therein.  I've
> tried forcing stream.type=application/zip in the CURL string, but that makes
> no difference.  If I specify an invalid stream.type then I get an exception
> response, so I know it's being used.
>
> When I send one of those txt files individually to the
> ExtractingRequestHandler, I get:
>
> SolrInputDocument[
>    {
>    ignored_meta=ignored_meta(1.0)={
>        [stream_source_info, file, stream_content_type, text/plain,
> stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
>        },
>    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
>
>  ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
>    ignored_stream_size=ignored_stream_size(1.0)={30},
>    ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
>    ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
>    docid=docid(1.0)={74},
>    type=type(1.0)={5},
>    text=text(1.0)={                The quick brown fox  }
>    }
> ]
>
> and we see the file contents in the "text" field.
>
> I'm using the following requestHandler definition in solrconfig.xml:
>
> <!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler -->
> <requestHandler name="/update/extract"
> class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
> startup="lazy">
> <lst name="defaults">
> <!-- All the main content goes into "text"... if you need to return
>           the extracted text or do highlighting, use a stored field. -->
> <str name="fmap.content">text</str>
> <str name="lowernames">true</str>
> <str name="uprefix">ignored_</str>
>
> <!-- capture link hrefs but ignore div attributes -->
> <str name="captureAttr">true</str>
> <str name="fmap.a">links</str>
> <str name="fmap.div">ignored_</str>
> </lst>
> </requestHandler>
>
> Is there any further debug or diagnostic I can get out of Tika to help me
> work out why it's only returning the file names and not the file contents
> when parsing a ZIP file?
>
>
> Thanks and kind regards,
> Gary.
>
>
>
> On 25/01/2011 16:48, Jayendra Patil wrote:
>
>> Hi Gary,
>>
>> The latest Solr Trunk was able to extract and index the contents of the
>> zip
>> file using the ExtractingRequestHandler.
>> The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
>> worked pretty well.
>>
>> Tested again with sample url and works fine -
>> curl "
>>
>> http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zip&literal.id=777045&literal.title=Test&commit=true
>> "
>>
>> You would probably need to drill down to the Tika Jars and
>> the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.
>>
>> Regards,
>> Jayendra
>>
>>
>

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Reply via email to