Re: Data Import Handler Rich Format Documents

Lance Norskog Fri, 24 Sep 2010 18:20:30 -0700

The TikaEntityProcessor is the class in the DIH that calls the Tikalibraries.TikaEntityProcessor is not in Solr 1.4 or 1.4.1. It is in the trunk andthe 3.x branch.

I have set it up from the 3.x branch. I discovered that the"DefaultParser" does not work, and you have to explicitly name theparser for the file format you want to use.


https://issues.apache.org/jira/browse/SOLR-2116

Tod wrote:

On 9/23/2010 6:52 AM, mehdi.es...@gmail.com wrote:
Hi,
I have exactly the same problem than the one you submitted in thislinkhttp://lucene.472066.n3.nabble.com/Data-Import-Handler-Rich-Format-Documents-td905478.htmland I would like to ask you if you got a solution for that.I started to have a look on tika and DataImportHandler but I don'tsuccess to find to right way of writing the syntax.So can you please give an example if you successed to find the rightsyntax.
Thanks.
Bumping this to the list...
Unfortunately I could never get DIH to work correctly. My suspicionis that I was using a stock 1.4.0 Solr but attempting to perform atask that was only available on the latest build. My customerrequirements demand a pretty well vetted GA release so experimentingwas not an option. I attempted an upgrade (quickly, sloppily) to1.4.1 but no luck. I believe the next GA release might be my solution.
I tried getting around that bump by trying SolrJContentStreamUpdateRequest @http://lucene.472066.n3.nabble.com/Solrj-ContentStreamUpdateRequest-Slow-td1023630.html#a1301927.After floundering for a while I decided to put that on hold. I endedup writing a Perl script that emulates the command line cURL that Ireferenced in the above thread. It took about 72 hours to index~850,000 entries (if anyone is interested).
I plan on looping back to try the suggestions Hoss last made, justhaven't had the time to respond. I'm sure things will work I justneeded something quickly and don't have the seasoned experience theother developers do.
- Tod

Re: Data Import Handler Rich Format Documents

Reply via email to