Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

Jayendra Patil Fri, 20 May 2011 19:13:07 -0700

Hi Gary,

I tried the patch on the the 3.1 source code (@
http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/)
as well and it worked fine.
@Patch - https://issues.apache.org/jira/browse/SOLR-2416, which deals
with the Solr Cell module.


You may want to verify the contents from the results by enabling the
stored attribute on the text field.

e.g. URL curl 
"http://localhost:8983/solr/update/extract?stream.file=C:/Test.zip&literal.id=777045&literal.title=Test&commit=true";

Let me know if it works. I would be happy to share the generated
artifact you can test on.

Regards,
Jayendra

On Fri, May 20, 2011 at 11:15 AM, Gary Taylor <g...@inovem.com> wrote:
> Hello again.  Unfortunately, I'm still getting nowhere with this.  I have
> checked-out the 3.1 source and applied Jayendra's patches (see below) and it
> still appears that the contents of the files in the zipfile are not being
> indexed, only the filenames of those contained files.
>
> I'm using a simple CURL invocation to test this:
>
> curl
> "http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5";
> -F "commit=true" -F "file=@solr1.zip"
>
> solr1.zip contains two simple txt files (doc1.txt and doc2.txt).  I'm
> expecting the contents of those txt files to be extracted from the zip and
> indexed, but this isn't happening - or at least, I don't get the desired
> result when I do a query afterwards.  I do get a match if I search for
> either "doc1.txt" or "doc2.txt", but not if I search for a word that appears
> in their contents.
>
> If I index one of the txt files (instead of the zipfile), I can query the
> content OK, so I'm assuming my query is sensible and matches the field
> specified on the CURL string (ie. "text").  I'm also happy that the Solr
> Cell content extraction is working because I can successfully index PDF,
> Word, etc. files.
>
> In a fit of desperation I have added log.info statements into the files
> referenced by Jayendra's patches (SOLR-2416 and SOLR-2332) and I see those
> in the log when I submit the zipfile with CURL, so I know I'm running those
> patched files in the build.
>
> If anyone can shed any light on what's happening here, I'd be very grateful.
>
> Thanks and kind regards,
> Gary.
>
>
> On 11/04/2011 11:12, Gary Taylor wrote:
>>
>> Jayendra,
>>
>> Thanks for the info - been keeping an eye on this list in case this topic
>> cropped up again.  It's currently a background task for me, so I'll try and
>> take a look at the patches and re-test soon.
>>
>> Joey - glad you brought this issue up again.  I haven't progressed any
>> further with it.  I've not yet moved to Solr 3.1 but it's on my to-do list,
>> as is testing out the patches referenced by Jayendra.  I'll post my findings
>> on this thread - if you manage to test the patches before me, let me know
>> how you get on.
>>
>> Thanks and kind regards,
>> Gary.
>>
>>
>> On 11/04/2011 05:02, Jayendra Patil wrote:
>>>
>>> The migration of Tika to the latest 0.8 version seems to have
>>> reintroduced the issue.
>>>
>>> I was able to get this working again with the following patches. (Solr
>>> Cell and Data Import handler)
>>>
>>> https://issues.apache.org/jira/browse/SOLR-2416
>>> https://issues.apache.org/jira/browse/SOLR-2332
>>>
>>> You can try these.
>>>
>>> Regards,
>>> Jayendra
>>>
>>> On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel<phan...@nearinfinity.com>
>>>  wrote:
>>>>
>>>> Hi Gary,
>>>>
>>>> I have been experiencing the same problem... Unable to extract content
>>>> from
>>>> archive file formats.  I just tried again with a clean install of Solr
>>>> 3.1.0
>>>> (using Tika 0.8) and continue to experience the same results.  Did you
>>>> have
>>>> any success with this problem with Solr 1.4.1 or 3.1.0 ?
>>>>
>>>> I'm using this curl command to send data to Solr.
>>>> curl "
>>>>
>>>> http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true";
>>>> -H "application/octet-stream" -F  "myfile=@data.zip"
>>>>
>>>> No problem extracting single rich text documents, but archive files only
>>>> result in the file names within the archive being indexed. Am I missing
>>>> something else in my configuration? Solr doesn't seem to be unpacking
>>>> the
>>>> archive files. Based on the email chain associated with your first
>>>> message,
>>>> some people have been able to get this functionality to work as desired.
>>>>
>>>
>>
>>
>
>
> --
> Gary Taylor
> INOVEM
>
> Tel +44 (0)1488 648 480
> Fax +44 (0)7092 115 933
> gary.tay...@inovem.com
> www.inovem.com
>
> INOVEM Ltd is registered in England and Wales No 4228932
> Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
>
>

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

Reply via email to