Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Joey Hanzel Mon, 11 Apr 2011 18:25:02 -0700

Awesome. Thanks Jayendra.  I hadn't caught these patches yet.

I applied SOLR-2416 patch to the solr-3.1 release tag. This resolved the
problem of archive files not being unpacked and indexed with Solr CELL.
Thanks for the FYI.
https://issues.apache.org/jira/browse/SOLR-2416


On Mon, Apr 11, 2011 at 12:02 AM, Jayendra Patil <
jayendra.patil....@gmail.com> wrote:

> The migration of Tika to the latest 0.8 version seems to have
> reintroduced the issue.
>
> I was able to get this working again with the following patches. (Solr
> Cell and Data Import handler)
>
> https://issues.apache.org/jira/browse/SOLR-2416
> https://issues.apache.org/jira/browse/SOLR-2332
>
> You can try these.
>
> Regards,
> Jayendra
>
> On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel <phan...@nearinfinity.com>
> wrote:
> > Hi Gary,
> >
> > I have been experiencing the same problem... Unable to extract content
> from
> > archive file formats.  I just tried again with a clean install of Solr
> 3.1.0
> > (using Tika 0.8) and continue to experience the same results.  Did you
> have
> > any success with this problem with Solr 1.4.1 or 3.1.0 ?
> >
> > I'm using this curl command to send data to Solr.
> > curl "
> >
> http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true
> "
> > -H "application/octet-stream" -F  "myfile=@data.zip"
> >
> > No problem extracting single rich text documents, but archive files only
> > result in the file names within the archive being indexed. Am I missing
> > something else in my configuration? Solr doesn't seem to be unpacking the
> > archive files. Based on the email chain associated with your first
> message,
> > some people have been able to get this functionality to work as desired.
> >
> > On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor <g...@inovem.com> wrote:
> >
> >> Can anyone shed any light on this, and whether it could be a config
> issue?
> >>  I'm now using the latest SVN trunk, which includes the Tika 0.8 jars.
> >>
> >> When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt)
> to
> >> the ExtractingRequestHandler, I get the following log entry (formatted
> for
> >> ease of reading) :
> >>
> >> SolrInputDocument[
> >>    {
> >>    ignored_meta=ignored_meta(1.0)={
> >>        [stream_source_info, file, stream_content_type,
> >> application/octet-stream, stream_size, 260, stream_name, solr1.zip,
> >> Content-Type, application/zip]
> >>        },
> >>    ignored_=ignored_(1.0)={
> >>        [package-entry, package-entry]
> >>        },
> >>    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
> >>
> >>
>  
> ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},
> >>
> >>    ignored_stream_size=ignored_stream_size(1.0)={260},
> >>    ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
> >>    ignored_content_type=ignored_content_type(1.0)={application/zip},
> >>    docid=docid(1.0)={74},
> >>    type=type(1.0)={5},
> >>    text=text(1.0)={                  doc2.txt    doc1.txt    }
> >>    }
> >> ]
> >>
> >> So, the data coming back from Tika when parsing a ZIP file does not
> include
> >> the file contents, only the names of the files contained therein.  I've
> >> tried forcing stream.type=application/zip in the CURL string, but that
> makes
> >> no difference.  If I specify an invalid stream.type then I get an
> exception
> >> response, so I know it's being used.
> >>
> >> When I send one of those txt files individually to the
> >> ExtractingRequestHandler, I get:
> >>
> >> SolrInputDocument[
> >>    {
> >>    ignored_meta=ignored_meta(1.0)={
> >>        [stream_source_info, file, stream_content_type, text/plain,
> >> stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
> >>        },
> >>    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
> >>
> >>
>  ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
> >>    ignored_stream_size=ignored_stream_size(1.0)={30},
> >>    ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
> >>    ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
> >>    docid=docid(1.0)={74},
> >>    type=type(1.0)={5},
> >>    text=text(1.0)={                The quick brown fox  }
> >>    }
> >> ]
> >>
> >> and we see the file contents in the "text" field.
> >>
> >> I'm using the following requestHandler definition in solrconfig.xml:
> >>
> >> <!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler-->
> >> <requestHandler name="/update/extract"
> >> class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
> >> startup="lazy">
> >> <lst name="defaults">
> >> <!-- All the main content goes into "text"... if you need to return
> >>           the extracted text or do highlighting, use a stored field. -->
> >> <str name="fmap.content">text</str>
> >> <str name="lowernames">true</str>
> >> <str name="uprefix">ignored_</str>
> >>
> >> <!-- capture link hrefs but ignore div attributes -->
> >> <str name="captureAttr">true</str>
> >> <str name="fmap.a">links</str>
> >> <str name="fmap.div">ignored_</str>
> >> </lst>
> >> </requestHandler>
> >>
> >> Is there any further debug or diagnostic I can get out of Tika to help
> me
> >> work out why it's only returning the file names and not the file
> contents
> >> when parsing a ZIP file?
> >>
> >>
> >> Thanks and kind regards,
> >> Gary.
> >>
> >>
> >>
> >> On 25/01/2011 16:48, Jayendra Patil wrote:
> >>
> >>> Hi Gary,
> >>>
> >>> The latest Solr Trunk was able to extract and index the contents of the
> >>> zip
> >>> file using the ExtractingRequestHandler.
> >>> The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
> >>> worked pretty well.
> >>>
> >>> Tested again with sample url and works fine -
> >>> curl "
> >>>
> >>>
> http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zip&literal.id=777045&literal.title=Test&commit=true
> >>> "
> >>>
> >>> You would probably need to drill down to the Tika Jars and
> >>> the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.
> >>>
> >>> Regards,
> >>> Jayendra
> >>>
> >>>
> >>
> >
>

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Reply via email to