Re: Solr Cell (ExtractingRequestHandler) and plain text files

Grant Ingersoll Tue, 10 Feb 2009 07:58:08 -0800

So, this seems to be an issue with Tika and it's mime type detectionof plain text. For some discussion on it, see http://www.lucidimagination.com/search/document/64e27546d23e67b9/mime_type_identification_of_plain_text_filesand also https://issues.apache.org/jira/browse/TIKA-154, which hasbeen committed and should be in 0.3.

In the meantime, you can add the ext.stream.type=text/plain or theext.resource.name=foo.txt, i.e.:curl "http://localhost:8983/solr/update/extract/?ext.idx.attr=true&ext.extract.only=true&ext.stream.type=text/plain" -F "myfi...@foo.txt"

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader"><int name="status">0</int><intname="QTime">1</int></lst><str name="foo.txt"><?xml version="1.0"encoding="UTF-8"?>

&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
    &lt;head&gt;
        &lt;title/&gt;
    &lt;/head&gt;
    &lt;body&gt;
        &lt;p&gt;this is some text


here is some more text
&lt;/p&gt;
    &lt;/body&gt;
&lt;/html&gt;
</str>
</response>

or

curl "http://localhost:8983/solr/update/extract/?ext.idx.attr=true&ext.extract.only=true&ext.resource.name=foo.txt" -F "myfi...@foo.txt"

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader"><int name="status">0</int><intname="QTime">1</int></lst><str name="foo.txt"><?xml version="1.0"encoding="UTF-8"?>

&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
    &lt;head&gt;
        &lt;title/&gt;
    &lt;/head&gt;
    &lt;body&gt;
        &lt;p&gt;this is some text

here is some more text
&lt;/p&gt;
    &lt;/body&gt;
&lt;/html&gt;
</str>
</response>

So, I guess the bottom line is that we should file a JIRA so we don'tlose track of it and test with 0.3.



On Feb 10, 2009, at 10:39 AM, Grant Ingersoll wrote:

OK, I have reproduced this. Let me debug for a moment and then wecan likely file a JIRA
On Feb 9, 2009, at 10:17 PM, Erik Hatcher wrote:
One other person has reported this to me off-list, and I justencountered it myself. ExtractingRequestHandler does not handleplain text files properly (no text is extracted). Here's an example:
curl "http://localhost:8982/solr/update/extract?ext.ignore.und.fl=true&wt=ruby&stream.file=/Users/erikhatcher/dev/cvreg/docs/JoeProgrammer.txt&ext.idx.attr=false&ext.literal.file_type_facet=txt&ext.def.fl=text_t&ext.literal.type_s=Resume&ext.literal.id=Resume:20&ext.extract.only=true&ext.resource.name=foo.txt"
{'responseHeader'=>{'status'=>0,'QTime'=>1},'JoeProgrammer.txt'=>'<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml";>
  <head>
      <title/>
  </head>
  <body/>
</html>
'}
Bound to be something simple with Tika config or to add a missingJAR or something?
Anyone have the magic incantation?

Thanks,
        Erik

Re: Solr Cell (ExtractingRequestHandler) and plain text files

Reply via email to