So, this seems to be an issue with Tika and it's mime type detection of plain text. For some discussion on it, see http://www.lucidimagination.com/search/document/64e27546d23e67b9/mime_type_identification_of_plain_text_files and also https://issues.apache.org/jira/browse/TIKA-154, which has been committed and should be in 0.3.

In the meantime, you can add the ext.stream.type=text/plain or the ext.resource.name=foo.txt, i.e.: curl "http://localhost:8983/solr/update/extract/?ext.idx.attr=true&ext.extract.only=true&ext.stream.type=text/plain " -F "myfi...@foo.txt"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int></lst><str name="foo.txt">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
    &lt;head&gt;
        &lt;title/&gt;
    &lt;/head&gt;
    &lt;body&gt;
        &lt;p&gt;this is some text

here is some more text
&lt;/p&gt;
    &lt;/body&gt;
&lt;/html&gt;
</str>
</response>

or
curl "http://localhost:8983/solr/update/extract/?ext.idx.attr=true&ext.extract.only=true&ext.resource.name=foo.txt " -F "myfi...@foo.txt"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int></lst><str name="foo.txt">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
    &lt;head&gt;
        &lt;title/&gt;
    &lt;/head&gt;
    &lt;body&gt;
        &lt;p&gt;this is some text

here is some more text
&lt;/p&gt;
    &lt;/body&gt;
&lt;/html&gt;
</str>
</response>


So, I guess the bottom line is that we should file a JIRA so we don't lose track of it and test with 0.3.


On Feb 10, 2009, at 10:39 AM, Grant Ingersoll wrote:

OK, I have reproduced this. Let me debug for a moment and then we can likely file a JIRA

On Feb 9, 2009, at 10:17 PM, Erik Hatcher wrote:

One other person has reported this to me off-list, and I just encountered it myself. ExtractingRequestHandler does not handle plain text files properly (no text is extracted). Here's an example:

curl "http://localhost:8982/solr/update/extract?ext.ignore.und.fl=true&wt=ruby&stream.file=/Users/erikhatcher/dev/cvreg/docs/JoeProgrammer.txt&ext.idx.attr=false&ext.literal.file_type_facet=txt&ext.def.fl=text_t&ext.literal.type_s=Resume&ext.literal.id=Resume:20&ext.extract.only=true&ext.resource.name=foo.txt "

{'responseHeader'=>{'status'=>0,'QTime'=>1},'JoeProgrammer.txt'=>'<? xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml";>
  <head>
      <title/>
  </head>
  <body/>
</html>
'}

Bound to be something simple with Tika config or to add a missing JAR or something?

Anyone have the magic incantation?

Thanks,
        Erik


Reply via email to