The TikaEntityProcessor is the class in the DIH that calls the Tika
libraries.
TikaEntityProcessor is not in Solr 1.4 or 1.4.1. It is in the trunk and
the 3.x branch.
I have set it up from the 3.x branch. I discovered that the
"DefaultParser" does not work, and you have to explicitly name the
parser for the file format you want to use.
https://issues.apache.org/jira/browse/SOLR-2116
Tod wrote:
On 9/23/2010 6:52 AM, mehdi.es...@gmail.com wrote:
Hi,
I have exactly the same problem than the one you submitted in this
link
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Rich-Format-Documents-td905478.html
and I would like to ask you if you got a solution for that.
I started to have a look on tika and DataImportHandler but I don't
success to find to right way of writing the syntax.
So can you please give an example if you successed to find the right
syntax.
Thanks.
Bumping this to the list...
Unfortunately I could never get DIH to work correctly. My suspicion
is that I was using a stock 1.4.0 Solr but attempting to perform a
task that was only available on the latest build. My customer
requirements demand a pretty well vetted GA release so experimenting
was not an option. I attempted an upgrade (quickly, sloppily) to
1.4.1 but no luck. I believe the next GA release might be my solution.
I tried getting around that bump by trying SolrJ
ContentStreamUpdateRequest @
http://lucene.472066.n3.nabble.com/Solrj-ContentStreamUpdateRequest-Slow-td1023630.html#a1301927.
After floundering for a while I decided to put that on hold. I ended
up writing a Perl script that emulates the command line cURL that I
referenced in the above thread. It took about 72 hours to index
~850,000 entries (if anyone is interested).
I plan on looping back to try the suggestions Hoss last made, just
haven't had the time to respond. I'm sure things will work I just
needed something quickly and don't have the seasoned experience the
other developers do.
- Tod