Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

2011-05-23 Thread Gary Taylor
Jayendra, I cleared out my local repository, and replayed all of my steps from Friday and it now it works. The only difference (or the only one that's obvious to me) was that I applied the patch before doing a full compile/test/dist. But I assumed that given I was seeing my new log entries

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

2011-05-20 Thread Jayendra Patil
Hi Gary, I tried the patch on the the 3.1 source code (@ http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/) as well and it worked fine. @Patch - https://issues.apache.org/jira/browse/SOLR-2416, which deals with the Solr Cell module. You may want to verify the contents from the

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

2011-05-20 Thread Gary Taylor
Hello again. Unfortunately, I'm still getting nowhere with this. I have checked-out the 3.1 source and applied Jayendra's patches (see below) and it still appears that the contents of the files in the zipfile are not being indexed, only the filenames of those contained files. I'm using a sim

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-11 Thread Joey Hanzel
Awesome. Thanks Jayendra. I hadn't caught these patches yet. I applied SOLR-2416 patch to the solr-3.1 release tag. This resolved the problem of archive files not being unpacked and indexed with Solr CELL. Thanks for the FYI. https://issues.apache.org/jira/browse/SOLR-2416 On Mon, Apr 11, 2011 a

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-11 Thread Gary Taylor
Jayendra, Thanks for the info - been keeping an eye on this list in case this topic cropped up again. It's currently a background task for me, so I'll try and take a look at the patches and re-test soon. Joey - glad you brought this issue up again. I haven't progressed any further with it.

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-10 Thread Jayendra Patil
The migration of Tika to the latest 0.8 version seems to have reintroduced the issue. I was able to get this working again with the following patches. (Solr Cell and Data Import handler) https://issues.apache.org/jira/browse/SOLR-2416 https://issues.apache.org/jira/browse/SOLR-2332 You can try t

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-10 Thread Joey Hanzel
Hi Gary, I have been experiencing the same problem... Unable to extract content from archive file formats. I just tried again with a clean install of Solr 3.1.0 (using Tika 0.8) and continue to experience the same results. Did you have any success with this problem with Solr 1.4.1 or 3.1.0 ? I'

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-31 Thread Gary Taylor
Can anyone shed any light on this, and whether it could be a config issue? I'm now using the latest SVN trunk, which includes the Tika 0.8 jars. When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) to the ExtractingRequestHandler, I get the following log entry (formatted

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Jayendra Patil
Hi Gary, The latest Solr Trunk was able to extract and index the contents of the zip file using the ExtractingRequestHandler. The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and worked pretty well. Tested again with sample url and works fine - curl " http://localhost:8080/solr

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor
OK, got past the schema.xml problem, but now I'm back to square one. I can index the contents of binary files (Word, PDF etc...), as well as text files, but it won't index the content of files inside a zip. As an example, I have two txt files - doc1.txt and doc2.txt. If I index either of the

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor
Thanks Erlend. Not used SVN before, but have managed to download and build latest trunk code. Now I'm getting an error when trying to access the admin page (via Jetty) because I specify HTMLStripStandardTokenizerFactory in my schema.xml, but this appears to be no-longer supplied as part of t

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Erlend Garåsen
On 25.01.11 11.30, Erlend Garåsen wrote: Tika version 0.8 is not included in the latest release/trunk from SVN. Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry. And to clarify, by "content" I mean the main content of a Word file. Title and other kinds of metadata are succes

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Erlend Garåsen
There seems to be a bug with the current 1.4.1 release. You cannot extract any content at all, regardless of content type. Try to get a fresh version from the SVN repository. I did that earlier today and can verify that Tika now will extract the content. I'm not sure about zip files. Tika