The TikaEntityProcessor is the class in the DIH that calls the Tika libraries. TikaEntityProcessor is not in Solr 1.4 or 1.4.1. It is in the trunk and the 3.x branch.

I have set it up from the 3.x branch. I discovered that the "DefaultParser" does not work, and you have to explicitly name the parser for the file format you want to use.

https://issues.apache.org/jira/browse/SOLR-2116

Tod wrote:
On 9/23/2010 6:52 AM, mehdi.es...@gmail.com wrote:
Hi,
I have exactly the same problem than the one you submitted in this link http://lucene.472066.n3.nabble.com/Data-Import-Handler-Rich-Format-Documents-td905478.html and I would like to ask you if you got a solution for that. I started to have a look on tika and DataImportHandler but I don't success to find to right way of writing the syntax. So can you please give an example if you successed to find the right syntax.
Thanks.

Bumping this to the list...

Unfortunately I could never get DIH to work correctly. My suspicion is that I was using a stock 1.4.0 Solr but attempting to perform a task that was only available on the latest build. My customer requirements demand a pretty well vetted GA release so experimenting was not an option. I attempted an upgrade (quickly, sloppily) to 1.4.1 but no luck. I believe the next GA release might be my solution.

I tried getting around that bump by trying SolrJ ContentStreamUpdateRequest @ http://lucene.472066.n3.nabble.com/Solrj-ContentStreamUpdateRequest-Slow-td1023630.html#a1301927. After floundering for a while I decided to put that on hold. I ended up writing a Perl script that emulates the command line cURL that I referenced in the above thread. It took about 72 hours to index ~850,000 entries (if anyone is interested).

I plan on looping back to try the suggestions Hoss last made, just haven't had the time to respond. I'm sure things will work I just needed something quickly and don't have the seasoned experience the other developers do.


- Tod

Reply via email to