So, this seems to be an issue with Tika and it's mime type detection
of plain text. For some discussion on it, see http://www.lucidimagination.com/search/document/64e27546d23e67b9/mime_type_identification_of_plain_text_files
and also https://issues.apache.org/jira/browse/TIKA-154, which has
been committed and should be in 0.3.
In the meantime, you can add the ext.stream.type=text/plain or the
ext.resource.name=foo.txt, i.e.:
curl "http://localhost:8983/solr/update/extract/?ext.idx.attr=true&ext.extract.only=true&ext.stream.type=text/plain
" -F "myfi...@foo.txt"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int
name="QTime">1</int></lst><str name="foo.txt"><?xml version="1.0"
encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body>
<p>this is some text
here is some more text
</p>
</body>
</html>
</str>
</response>
or
curl "http://localhost:8983/solr/update/extract/?ext.idx.attr=true&ext.extract.only=true&ext.resource.name=foo.txt
" -F "myfi...@foo.txt"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int
name="QTime">1</int></lst><str name="foo.txt"><?xml version="1.0"
encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body>
<p>this is some text
here is some more text
</p>
</body>
</html>
</str>
</response>
So, I guess the bottom line is that we should file a JIRA so we don't
lose track of it and test with 0.3.
On Feb 10, 2009, at 10:39 AM, Grant Ingersoll wrote:
OK, I have reproduced this. Let me debug for a moment and then we
can likely file a JIRA
On Feb 9, 2009, at 10:17 PM, Erik Hatcher wrote:
One other person has reported this to me off-list, and I just
encountered it myself. ExtractingRequestHandler does not handle
plain text files properly (no text is extracted). Here's an example:
curl "http://localhost:8982/solr/update/extract?ext.ignore.und.fl=true&wt=ruby&stream.file=/Users/erikhatcher/dev/cvreg/docs/JoeProgrammer.txt&ext.idx.attr=false&ext.literal.file_type_facet=txt&ext.def.fl=text_t&ext.literal.type_s=Resume&ext.literal.id=Resume:20&ext.extract.only=true&ext.resource.name=foo.txt
"
{'responseHeader'=>{'status'=>0,'QTime'=>1},'JoeProgrammer.txt'=>'<?
xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body/>
</html>
'}
Bound to be something simple with Tika config or to add a missing
JAR or something?
Anyone have the magic incantation?
Thanks,
Erik