Re: Data Import Handler Rich Format Documents

2010-09-29 Thread Chris Hostetter
: What's a GA release? http://en.wikipedia.org/wiki/Software_release_life_cycle#General_availability -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!

Re: Data Import Handler Rich Format Documents

2010-09-24 Thread Dennis Gearon
What's a GA release? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/24/10, Lance Norskog wrote: > From: Lance Norskog > Subject: Re

Re: Data Import Handler Rich Format Documents

2010-09-24 Thread Lance Norskog
The TikaEntityProcessor is the class in the DIH that calls the Tika libraries. TikaEntityProcessor is not in Solr 1.4 or 1.4.1. It is in the trunk and the 3.x branch. I have set it up from the 3.x branch. I discovered that the "DefaultParser" does not work, and you have to explicitly name the

Re: Data Import Handler Rich Format Documents

2010-09-24 Thread Tod
On 9/23/2010 6:52 AM, mehdi.es...@gmail.com wrote: Hi, I have exactly the same problem than the one you submitted in this link http://lucene.472066.n3.nabble.com/Data-Import-Handler-Rich-Format-Documents-td905478.html and I would like to ask you if you got a solution for that. I started to have

Re: Data Import Handler Rich Format Documents

2010-07-06 Thread Tod
On 6/28/2010 8:28 AM, Alexey Serba wrote: Ok, I'm trying to integrate the TikaEntityProcessor as suggested. �I'm using Solr Version: 1.4.0 and getting the following error: java.lang.ClassNotFoundException: Unable to load BinURLDataSource or org.apache.solr.handler.dataimport.BinURLDataSource It

Re: Data Import Handler Rich Format Documents

2010-06-28 Thread Alexey Serba
> Ok, I'm trying to integrate the TikaEntityProcessor as suggested.  I'm using > Solr Version: 1.4.0 and getting the following error: > > java.lang.ClassNotFoundException: Unable to load BinURLDataSource or > org.apache.solr.handler.dataimport.BinURLDataSource It seems that DIH-Tika integration is

Re: Data Import Handler Rich Format Documents

2010-06-22 Thread Tod
On 6/18/2010 2:42 PM, Chris Hostetter wrote: : > I don't think DIH can do that, but who knows, let's see what others say. : Looks like the ExtractingRequestHandler uses Tika as well. I might just use : this but I'm wondering if there will be a large performance difference between : using it to

Re: Data Import Handler Rich Format Documents

2010-06-21 Thread Alexey Serba
You are right. It seems TikaEntityProcessor is exactly the tool you need in this case. Alex On Sat, Jun 19, 2010 at 2:59 AM, Chris Hostetter wrote: > : I think you can use existing ExtractingRequestHandler to do the job, > : i.e. add child entity to your DIH metadata > > why would you do this in

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Chris Hostetter
: I think you can use existing ExtractingRequestHandler to do the job, : i.e. add child entity to your DIH metadata why would you do this instead of using the TikaEntityProcessor as i already suggested in my earlier mail? -Hoss

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Alexey Serba
I think you can use existing ExtractingRequestHandler to do the job, i.e. add child entity to your DIH metadata http://localhost:8983/solr/update/extract?extractOnly=true&wt=xml&indent=on&stream.url=${metadata.url}"; dataSource="solr"> That's not working example, just basic

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Tod
On 6/18/2010 2:42 PM, Chris Hostetter wrote: : > I don't think DIH can do that, but who knows, let's see what others say. : Looks like the ExtractingRequestHandler uses Tika as well. I might just use : this but I'm wondering if there will be a large performance difference between : using it to

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Sixten Otto
On Fri, Jun 18, 2010 at 2:42 PM, Chris Hostetter wrote: > I'm confused ... You're using DIH, and some of your fields are URLs to > documents that you want to parse with Tika? > > Why would you need a custom Transformer? Yeah, I can definitely vouch that DIH can handle this without additional codi

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Chris Hostetter
: > I don't think DIH can do that, but who knows, let's see what others say. : Looks like the ExtractingRequestHandler uses Tika as well. I might just use : this but I'm wondering if there will be a large performance difference between : using it to batch content in over rolling my own Transform

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Tod
On 6/18/2010 11:24 AM, Otis Gospodnetic wrote: Tod, I don't think DIH can do that, but who knows, let's see what others say. Yes, Nutch uses TIKA, too. Otis Looks like the ExtractingRequestHandler uses Tika as well. I might just use this but I'm wondering if there will be a large performan

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Otis Gospodnetic
t; To: solr-user@lucene.apache.org > Sent: Fri, June 18, 2010 10:20:34 AM > Subject: Re: Data Import Handler Rich Format Documents > > On 6/18/2010 9:12 AM, Otis Gospodnetic wrote: > Tod, > > You > didn't mention Tika, which makes me think you are not aware of it... > You

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Tod
On 6/18/2010 9:12 AM, Otis Gospodnetic wrote: Tod, You didn't mention Tika, which makes me think you are not aware of it... You could implement a custom Transformer that uses Tika to perform rich doc text extraction, just like ExtractingRequestHandler does it (see http://wiki.apache.org/solr/E

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Otis Gospodnetic
Tod, You didn't mention Tika, which makes me think you are not aware of it... You could implement a custom Transformer that uses Tika to perform rich doc text extraction, just like ExtractingRequestHandler does it (see http://wiki.apache.org/solr/ExtractingRequestHandler ). Maybe you could even