Re: Data Import Handler Rich Format Documents

2010-09-29 Thread Chris Hostetter
: What's a GA release? http://en.wikipedia.org/wiki/Software_release_life_cycle#General_availability -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!

Re: Data Import Handler Rich Format Documents

2010-09-24 Thread Dennis Gearon
> On 9/23/2010 6:52 AM, mehdi.es...@gmail.com > wrote: > >> Hi, > >> I have exactly the same problem than the one you > submitted in this link > http://lucene.472066.n3.nabble.com/Data-Import-Handler-Rich-Format-Documents-td905478.html > and I would like to ask

Re: Data Import Handler Rich Format Documents

2010-09-24 Thread Lance Norskog
y name the parser for the file format you want to use. https://issues.apache.org/jira/browse/SOLR-2116 Tod wrote: On 9/23/2010 6:52 AM, mehdi.es...@gmail.com wrote: Hi, I have exactly the same problem than the one you submitted in this link http://lucene.472066.n3.nabble.com/Data-Import-Ha

Re: Data Import Handler Rich Format Documents

2010-09-24 Thread Tod
On 9/23/2010 6:52 AM, mehdi.es...@gmail.com wrote: Hi, I have exactly the same problem than the one you submitted in this link http://lucene.472066.n3.nabble.com/Data-Import-Handler-Rich-Format-Documents-td905478.html and I would like to ask you if you got a solution for that. I started to

Re: Data Import Handler Rich Format Documents

2010-07-06 Thread Tod
On 6/28/2010 8:28 AM, Alexey Serba wrote: Ok, I'm trying to integrate the TikaEntityProcessor as suggested. �I'm using Solr Version: 1.4.0 and getting the following error: java.lang.ClassNotFoundException: Unable to load BinURLDataSource or org.apache.solr.handler.dataimport.BinURLDataSource It

Re: Data Import Handler Rich Format Documents

2010-06-28 Thread Alexey Serba
> Ok, I'm trying to integrate the TikaEntityProcessor as suggested.  I'm using > Solr Version: 1.4.0 and getting the following error: > > java.lang.ClassNotFoundException: Unable to load BinURLDataSource or > org.apache.solr.handler.dataimport.BinURLDataSource It seems that DIH-Tika integration is

Re: Data Import Handler Rich Format Documents

2010-06-22 Thread Tod
On 6/18/2010 2:42 PM, Chris Hostetter wrote: : > I don't think DIH can do that, but who knows, let's see what others say. : Looks like the ExtractingRequestHandler uses Tika as well. I might just use : this but I'm wondering if there will be a large performance difference between : using it to

Re: Data Import Handler Rich Format Documents

2010-06-21 Thread Alexey Serba
You are right. It seems TikaEntityProcessor is exactly the tool you need in this case. Alex On Sat, Jun 19, 2010 at 2:59 AM, Chris Hostetter wrote: > : I think you can use existing ExtractingRequestHandler to do the job, > : i.e. add child entity to your DIH metadata > > why would you do this in

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Chris Hostetter
: I think you can use existing ExtractingRequestHandler to do the job, : i.e. add child entity to your DIH metadata why would you do this instead of using the TikaEntityProcessor as i already suggested in my earlier mail? -Hoss

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Alexey Serba
I think you can use existing ExtractingRequestHandler to do the job, i.e. add child entity to your DIH metadata http://localhost:8983/solr/update/extract?extractOnly=true&wt=xml&indent=on&stream.url=${metadata.url}"; dataSource="solr"> That's not working example, just basic

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Tod
On 6/18/2010 2:42 PM, Chris Hostetter wrote: : > I don't think DIH can do that, but who knows, let's see what others say. : Looks like the ExtractingRequestHandler uses Tika as well. I might just use : this but I'm wondering if there will be a large performance difference between : using it to

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Sixten Otto
On Fri, Jun 18, 2010 at 2:42 PM, Chris Hostetter wrote: > I'm confused ... You're using DIH, and some of your fields are URLs to > documents that you want to parse with Tika? > > Why would you need a custom Transformer? Yeah, I can definitely vouch that DIH can handle this without additional codi

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Chris Hostetter
: > I don't think DIH can do that, but who knows, let's see what others say. : Looks like the ExtractingRequestHandler uses Tika as well. I might just use : this but I'm wondering if there will be a large performance difference between : using it to batch content in over rolling my own Transform

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Tod
On 6/18/2010 11:24 AM, Otis Gospodnetic wrote: Tod, I don't think DIH can do that, but who knows, let's see what others say. Yes, Nutch uses TIKA, too. Otis Looks like the ExtractingRequestHandler uses Tika as well. I might just use this but I'm wondering if there will be a large performan

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Otis Gospodnetic
t; To: solr-user@lucene.apache.org > Sent: Fri, June 18, 2010 10:20:34 AM > Subject: Re: Data Import Handler Rich Format Documents > > On 6/18/2010 9:12 AM, Otis Gospodnetic wrote: > Tod, > > You > didn't mention Tika, which makes me think you are not aware of it... > You

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Tod
lazy and trying to see if a method of doing this has been incorporated into the latest Solr release so I can avoid coding for it. - Original Message From: Tod To: solr-user@lucene.apache.org Sent: Fri, June 18, 2010 8:51:02 AM Subject: Data Import Handler Rich Format Document

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Otis Gospodnetic
.org > Sent: Fri, June 18, 2010 8:51:02 AM > Subject: Data Import Handler Rich Format Documents > > I have a database containing Metadata from a content management system. > Part of that data includes a URL pointing to the actual published document > which > can be an HTML fi

Data Import Handler Rich Format Documents

2010-06-18 Thread Tod
I have a database containing Metadata from a content management system. Part of that data includes a URL pointing to the actual published document which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc. I'm already indexing the Metadata and that provides a lot of value. The custom