Solr Cell (ExtractingRequestHandler) and plain text files

One other person has reported this to me off-list, and I justencountered it myself. ExtractingRequestHandler does not handle plaintext files properly (no text is extracted). Here's an example:

curl "http://localhost:8982/solr/update/extract?ext.ignore.und.fl=true&wt=ruby&stream.file=/Users/erikhatcher/dev/cvreg/docs/JoeProgrammer.txt&ext.idx.attr=false&ext.literal.file_type_facet=txt&ext.def.fl=text_t&ext.literal.type_s=Resume&ext.literal.id=Resume:20&ext.extract.only=true&ext.resource.name=foo.txt"

{'responseHeader'=>{'status'=>0,'QTime'=>1},'JoeProgrammer.txt'=>'<?xml version="1.0" encoding="UTF-8"?>

<html xmlns="http://www.w3.org/1999/xhtml";>
    <head>
        <title/>
    </head>
    <body/>
</html>
'}

Bound to be something simple with Tika config or to add a missing JARor something?


Anyone have the magic incantation?

Thanks,
        Erik

Solr Cell (ExtractingRequestHandler) and plain text files

Reply via email to