Re: Data Import Handler Rich Format Documents

Dennis Gearon Fri, 24 Sep 2010 18:32:12 -0700

What's a GA release?

Dennis Gearon


Signature Warning
----------------
EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/24/10, Lance Norskog <goks...@gmail.com> wrote:

> From: Lance Norskog <goks...@gmail.com>
> Subject: Re: Data Import Handler Rich Format Documents
> To: solr-user@lucene.apache.org
> Date: Friday, September 24, 2010, 6:19 PM
> The TikaEntityProcessor is the class
> in the DIH that calls the Tika libraries.
> TikaEntityProcessor is not in Solr 1.4 or 1.4.1. It is in
> the trunk and the 3.x branch.
> 
> I have set it up from the 3.x branch. I discovered that the
> "DefaultParser" does not work, and you have to explicitly
> name the parser for the file format you want to use.
> 
> https://issues.apache.org/jira/browse/SOLR-2116
> 
> Tod wrote:
> > On 9/23/2010 6:52 AM, mehdi.es...@gmail.com
> wrote:
> >> Hi,
> >> I have exactly the same problem than the one you
> submitted in this link 
> http://lucene.472066.n3.nabble.com/Data-Import-Handler-Rich-Format-Documents-td905478.html
> and I would like to ask you if you got a solution for that.
> >> I started to have a look on tika and
> DataImportHandler but I don't success to find to right way
> of writing the syntax.
> >> So can you please give an example if you successed
> to find the right syntax.
> >> Thanks.
> > 
> > Bumping this to the list...
> > 
> > Unfortunately I could never get DIH to work
> correctly.  My suspicion is that I was using a stock
> 1.4.0 Solr but attempting to perform a task that was only
> available on the latest build.  My customer
> requirements demand a pretty well vetted GA release so
> experimenting was not an option.  I attempted an
> upgrade (quickly, sloppily) to 1.4.1 but no luck.  I
> believe the next GA release might be my solution.
> > 
> > I tried getting around that bump by trying SolrJ
> ContentStreamUpdateRequest @ 
> http://lucene.472066.n3.nabble.com/Solrj-ContentStreamUpdateRequest-Slow-td1023630.html#a1301927. 
> After floundering for a while I decided to put that on
> hold.  I ended up writing a Perl script that emulates
> the command line cURL that I referenced in the above
> thread.  It took about 72 hours to index ~850,000
> entries (if anyone is interested).
> > 
> > I plan on looping back to try the suggestions Hoss
> last made, just haven't had the time to respond.  I'm
> sure things will work I just needed something quickly and
> don't have the seasoned experience the other developers do.
> > 
> > 
> > - Tod
>

Re: Data Import Handler Rich Format Documents

Reply via email to