Hi Gary, I tried the patch on the the 3.1 source code (@ http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/) as well and it worked fine. @Patch - https://issues.apache.org/jira/browse/SOLR-2416, which deals with the Solr Cell module.
You may want to verify the contents from the results by enabling the stored attribute on the text field. e.g. URL curl "http://localhost:8983/solr/update/extract?stream.file=C:/Test.zip&literal.id=777045&literal.title=Test&commit=true" Let me know if it works. I would be happy to share the generated artifact you can test on. Regards, Jayendra On Fri, May 20, 2011 at 11:15 AM, Gary Taylor <g...@inovem.com> wrote: > Hello again. Unfortunately, I'm still getting nowhere with this. I have > checked-out the 3.1 source and applied Jayendra's patches (see below) and it > still appears that the contents of the files in the zipfile are not being > indexed, only the filenames of those contained files. > > I'm using a simple CURL invocation to test this: > > curl > "http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5" > -F "commit=true" -F "file=@solr1.zip" > > solr1.zip contains two simple txt files (doc1.txt and doc2.txt). I'm > expecting the contents of those txt files to be extracted from the zip and > indexed, but this isn't happening - or at least, I don't get the desired > result when I do a query afterwards. I do get a match if I search for > either "doc1.txt" or "doc2.txt", but not if I search for a word that appears > in their contents. > > If I index one of the txt files (instead of the zipfile), I can query the > content OK, so I'm assuming my query is sensible and matches the field > specified on the CURL string (ie. "text"). I'm also happy that the Solr > Cell content extraction is working because I can successfully index PDF, > Word, etc. files. > > In a fit of desperation I have added log.info statements into the files > referenced by Jayendra's patches (SOLR-2416 and SOLR-2332) and I see those > in the log when I submit the zipfile with CURL, so I know I'm running those > patched files in the build. > > If anyone can shed any light on what's happening here, I'd be very grateful. > > Thanks and kind regards, > Gary. > > > On 11/04/2011 11:12, Gary Taylor wrote: >> >> Jayendra, >> >> Thanks for the info - been keeping an eye on this list in case this topic >> cropped up again. It's currently a background task for me, so I'll try and >> take a look at the patches and re-test soon. >> >> Joey - glad you brought this issue up again. I haven't progressed any >> further with it. I've not yet moved to Solr 3.1 but it's on my to-do list, >> as is testing out the patches referenced by Jayendra. I'll post my findings >> on this thread - if you manage to test the patches before me, let me know >> how you get on. >> >> Thanks and kind regards, >> Gary. >> >> >> On 11/04/2011 05:02, Jayendra Patil wrote: >>> >>> The migration of Tika to the latest 0.8 version seems to have >>> reintroduced the issue. >>> >>> I was able to get this working again with the following patches. (Solr >>> Cell and Data Import handler) >>> >>> https://issues.apache.org/jira/browse/SOLR-2416 >>> https://issues.apache.org/jira/browse/SOLR-2332 >>> >>> You can try these. >>> >>> Regards, >>> Jayendra >>> >>> On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel<phan...@nearinfinity.com> >>> wrote: >>>> >>>> Hi Gary, >>>> >>>> I have been experiencing the same problem... Unable to extract content >>>> from >>>> archive file formats. I just tried again with a clean install of Solr >>>> 3.1.0 >>>> (using Tika 0.8) and continue to experience the same results. Did you >>>> have >>>> any success with this problem with Solr 1.4.1 or 3.1.0 ? >>>> >>>> I'm using this curl command to send data to Solr. >>>> curl " >>>> >>>> http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true" >>>> -H "application/octet-stream" -F "myfile=@data.zip" >>>> >>>> No problem extracting single rich text documents, but archive files only >>>> result in the file names within the archive being indexed. Am I missing >>>> something else in my configuration? Solr doesn't seem to be unpacking >>>> the >>>> archive files. Based on the email chain associated with your first >>>> message, >>>> some people have been able to get this functionality to work as desired. >>>> >>> >> >> > > > -- > Gary Taylor > INOVEM > > Tel +44 (0)1488 648 480 > Fax +44 (0)7092 115 933 > gary.tay...@inovem.com > www.inovem.com > > INOVEM Ltd is registered in England and Wales No 4228932 > Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE > >