Awesome. Thanks Jayendra. I hadn't caught these patches yet. I applied SOLR-2416 patch to the solr-3.1 release tag. This resolved the problem of archive files not being unpacked and indexed with Solr CELL. Thanks for the FYI. https://issues.apache.org/jira/browse/SOLR-2416
On Mon, Apr 11, 2011 at 12:02 AM, Jayendra Patil < jayendra.patil....@gmail.com> wrote: > The migration of Tika to the latest 0.8 version seems to have > reintroduced the issue. > > I was able to get this working again with the following patches. (Solr > Cell and Data Import handler) > > https://issues.apache.org/jira/browse/SOLR-2416 > https://issues.apache.org/jira/browse/SOLR-2332 > > You can try these. > > Regards, > Jayendra > > On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel <phan...@nearinfinity.com> > wrote: > > Hi Gary, > > > > I have been experiencing the same problem... Unable to extract content > from > > archive file formats. I just tried again with a clean install of Solr > 3.1.0 > > (using Tika 0.8) and continue to experience the same results. Did you > have > > any success with this problem with Solr 1.4.1 or 3.1.0 ? > > > > I'm using this curl command to send data to Solr. > > curl " > > > http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true > " > > -H "application/octet-stream" -F "myfile=@data.zip" > > > > No problem extracting single rich text documents, but archive files only > > result in the file names within the archive being indexed. Am I missing > > something else in my configuration? Solr doesn't seem to be unpacking the > > archive files. Based on the email chain associated with your first > message, > > some people have been able to get this functionality to work as desired. > > > > On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor <g...@inovem.com> wrote: > > > >> Can anyone shed any light on this, and whether it could be a config > issue? > >> I'm now using the latest SVN trunk, which includes the Tika 0.8 jars. > >> > >> When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) > to > >> the ExtractingRequestHandler, I get the following log entry (formatted > for > >> ease of reading) : > >> > >> SolrInputDocument[ > >> { > >> ignored_meta=ignored_meta(1.0)={ > >> [stream_source_info, file, stream_content_type, > >> application/octet-stream, stream_size, 260, stream_name, solr1.zip, > >> Content-Type, application/zip] > >> }, > >> ignored_=ignored_(1.0)={ > >> [package-entry, package-entry] > >> }, > >> ignored_stream_source_info=ignored_stream_source_info(1.0)={file}, > >> > >> > > ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream}, > >> > >> ignored_stream_size=ignored_stream_size(1.0)={260}, > >> ignored_stream_name=ignored_stream_name(1.0)={solr1.zip}, > >> ignored_content_type=ignored_content_type(1.0)={application/zip}, > >> docid=docid(1.0)={74}, > >> type=type(1.0)={5}, > >> text=text(1.0)={ doc2.txt doc1.txt } > >> } > >> ] > >> > >> So, the data coming back from Tika when parsing a ZIP file does not > include > >> the file contents, only the names of the files contained therein. I've > >> tried forcing stream.type=application/zip in the CURL string, but that > makes > >> no difference. If I specify an invalid stream.type then I get an > exception > >> response, so I know it's being used. > >> > >> When I send one of those txt files individually to the > >> ExtractingRequestHandler, I get: > >> > >> SolrInputDocument[ > >> { > >> ignored_meta=ignored_meta(1.0)={ > >> [stream_source_info, file, stream_content_type, text/plain, > >> stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt] > >> }, > >> ignored_stream_source_info=ignored_stream_source_info(1.0)={file}, > >> > >> > ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain}, > >> ignored_stream_size=ignored_stream_size(1.0)={30}, > >> ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1}, > >> ignored_stream_name=ignored_stream_name(1.0)={doc1.txt}, > >> docid=docid(1.0)={74}, > >> type=type(1.0)={5}, > >> text=text(1.0)={ The quick brown fox } > >> } > >> ] > >> > >> and we see the file contents in the "text" field. > >> > >> I'm using the following requestHandler definition in solrconfig.xml: > >> > >> <!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler--> > >> <requestHandler name="/update/extract" > >> class="org.apache.solr.handler.extraction.ExtractingRequestHandler" > >> startup="lazy"> > >> <lst name="defaults"> > >> <!-- All the main content goes into "text"... if you need to return > >> the extracted text or do highlighting, use a stored field. --> > >> <str name="fmap.content">text</str> > >> <str name="lowernames">true</str> > >> <str name="uprefix">ignored_</str> > >> > >> <!-- capture link hrefs but ignore div attributes --> > >> <str name="captureAttr">true</str> > >> <str name="fmap.a">links</str> > >> <str name="fmap.div">ignored_</str> > >> </lst> > >> </requestHandler> > >> > >> Is there any further debug or diagnostic I can get out of Tika to help > me > >> work out why it's only returning the file names and not the file > contents > >> when parsing a ZIP file? > >> > >> > >> Thanks and kind regards, > >> Gary. > >> > >> > >> > >> On 25/01/2011 16:48, Jayendra Patil wrote: > >> > >>> Hi Gary, > >>> > >>> The latest Solr Trunk was able to extract and index the contents of the > >>> zip > >>> file using the ExtractingRequestHandler. > >>> The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and > >>> worked pretty well. > >>> > >>> Tested again with sample url and works fine - > >>> curl " > >>> > >>> > http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zip&literal.id=777045&literal.title=Test&commit=true > >>> " > >>> > >>> You would probably need to drill down to the Tika Jars and > >>> the apache-solr-cell-4.0-dev.jar used for Rich documents indexing. > >>> > >>> Regards, > >>> Jayendra > >>> > >>> > >> > > >