Re: Data Import Handler Rich Format Documents

Otis Gospodnetic Fri, 18 Jun 2010 08:24:54 -0700

Tod,

I don't think DIH can do that, but who knows, let's see what others say.
Yes, Nutch uses TIKA, too.


 Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Tod <listac...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Fri, June 18, 2010 10:20:34 AM
> Subject: Re: Data Import Handler Rich Format Documents
> 
> On 6/18/2010 9:12 AM, Otis Gospodnetic wrote:
> Tod,
> 
> You 
> didn't mention Tika, which makes me think you are not aware of it...
> You 
> could implement a custom Transformer that uses Tika to perform rich doc text 
> extraction, just like ExtractingRequestHandler does it (see 
> href="http://wiki.apache.org/solr/ExtractingRequestHandler"; target=_blank 
> >http://wiki.apache.org/solr/ExtractingRequestHandler ).  Maybe you 
> could even just call ERH from your Transformer, though that wouldn't be the 
> most 
> efficient.


You're right, sorry.  I have looked at Tika, which I 
> believe is used by Nutch too - no?

Implementing a transformer is 
> fine.  I guess I'm being lazy and trying to see if a method of doing this 
> has been incorporated into the latest Solr release so I can avoid coding for 
> it.




> 
> 
> ----- Original Message 
> ----
>> From: Tod <
> href="mailto:listac...@gmail.com";>listac...@gmail.com>
>> To: 
> ymailto="mailto:solr-user@lucene.apache.org"; 
> href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org
>> 
> Sent: Fri, June 18, 2010 8:51:02 AM
>> Subject: Data Import Handler 
> Rich Format Documents
>> 
>> I have a database containing 
> Metadata from a content management system.  Part of that data includes a 
> URL pointing to the actual published document which can be an HTML file or a 
> PDF, MS Word/Excel/Powerpoint, etc.
> 
> I'm already 
>> 
> indexing the Metadata and that provides a lot of value.  The customer 
> however would like that the content pointed to by the URL also be indexed for 
> more discrete searching.
> 
> This article at Lucid:
> 
> 
> 
>> href="
> href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS";
>  
> target=_blank 
> >http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS";
> > 
> target=_blank 
>>> 
> href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS";
>  
> target=_blank 
> >http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS
> 
> 
> describes 
>> the process of coding a custom 
> transformer.  A separate article I've read implies Nutch could be used to 
> provide this functionality too.
> 
> What would 
>> be the 
> best and most efficient way to accomplish what I'm trying to do?  I have a 
> feeling the Lucid article might be dated and there might ways to do this now 
> without any coding and maybe without even needing to use Nutch.  I'm using 
> the current release version of Solr.
> 
> Thanks in 
>> 
> advance.
> 
> 
> - Tod
>

Re: Data Import Handler Rich Format Documents

Reply via email to