RE: Tika: How can I import automatically all metadata without specifiying them explicitly

Gian Maria Ricci Mon, 27 May 2013 07:43:57 -0700

Thanks a lot, other useful hints, and probably standalone Tika could be a 
solution.


I've another little question: how can I express filters in DIH configuration to 
run import of the server incrementally?

Actually I've two distinct scenario. 

In first scenario I've documents stored inside database, so I need to write a 
DIH to import data from database and since I have timestamp column this is not 
a problem.

Second scenario: need to monitor one folder, and do incremental population each 
15 minutes. Usually with Sql DIH I use some column as a filter to do 
incremental population, but I wonder if it is possible to pass filter to 
BinFileDataSource, telling to process only new files and those modified after a 
timestamp (last run).

Thanks again for all your precious suggestions.

--
Gian Maria Ricci
Mobile: +39 320 0136949
    


-----Original Message-----
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Monday, May 27, 2013 1:44 PM
To: solr-user@lucene.apache.org
Subject: RE: Tika: How can I import automatically all metadata without 
specifiying them explicitly

Standalone Tika can also run in a network server mode.  That increases data 
roundtrips but gives you more options. Even in .net .

Regards,
      Alex
On 27 May 2013 04:22, "Gian Maria Ricci" <alkamp...@nablasoft.com> wrote:

> Thanks for the help.
>
> @Alexandre: Thanks for the suggestion, I'll try to use an 
> ExtractingRequestHandler, I thought that I was missing some DIH option :).
>
> @Erik: I'm interested in knowing them all to do various form of 
> analysis. I have documents coming from heterogeneous sources and I'm 
> interested in searching inside the content, but also being able to 
> extract all possible metadata. I'm working in .Net so it is useful 
> letting tika doing everything for me directly in solr and then 
> retrieve all metadata for matched documents.
>
> Thanks again to everyone.
>
> --
> Gian Maria Ricci
> Mobile: +39 320 0136949
>
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Sunday, May 26, 2013 5:30 PM
> To: solr-user@lucene.apache.org; Gian Maria Ricci
> Subject: Re: Tika: How can I import automatically all metadata without 
> specifiying them explicitly
>
> In addition to Alexandre's comment:
>
> bq:  ...I'd like to import in my index all metadata
>
> Be a little careful here, this isn't actually very useful in my experience.
> Sure
> it's nice to have all that data in the index, but... how do you search 
> it meaningfully?
>
> Consider that some doc may have an "author" metadata field. Another 
> may have a "last editor" field. Yet another may have a "main author" 
> field. If you add all these as their field name, what do you do to 
> search for "author"?
> Somehow you have to create a mapping between the various metadata 
> names and something that's searchable, why not do this at index time?
>
> Not to mention I've seen this done and the result may be literally 
> hundreds of different metadata fields which are not very useful.
>
> All that said, it may be perfectly valid to inde them all, but before 
> going there it's worth considering whether the result is actually _useful_.
>
> Best
> Erick
>
>
> On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci
> <alkamp...@nablasoft.com>wrote:
>
> > Hi to everyone,****
> >
> > ** **
> >
> > I've configured import of a document folder with 
> > FileListEntityProcessor, everything went smooth on the first try, 
> > but I have a simple question. I'm able to map metadata without any 
> > problem, but I'd like to import in my index all metadata, not only 
> > those I've configured with field nodes. In this example I've 
> > imported Author and title, but I does not know in advance which 
> > metadata a document could have and I wish to have all of them inside 
> > my
> > index.****
> >
> > ** **
> >
> > Here is my import config. It is the first try with importing with 
> > tika and probably I'm missing a simple stuff.****
> >
> > ** **
> >
> > <dataConfig>  ****
> >
> >                 <dataSource type="BinFileDataSource" />****
> >
> >                                 <document>****
> >
> >                                                 <entity name="files"
> > dataSource="null" rootEntity="false"****
> >
> >
> > processor="FileListEntityProcessor" ****
> >
> >                                                 baseDir="c:/temp/docs"
> > fileName=".*\.(doc)|(pdf)|(docx)"****
> >
> >                                                 onError="skip"****
> >
> >                                                 
> > recursive="true">****
> >
> >                                                                 
> > <field column="file" name="id" />****
> >
> >                                                                 
> > <field column="fileAbsolutePath" name="path" />****
> >
> >                                                                 
> > <field column="fileSize" name="size" />****
> >
> >                                                                 
> > <field column="fileLastModified" name="lastModified" />****
> >
> >                                                                 ****
> >
> >
> > <entity **
> > **
> >
> >
> > name="documentImport" ****
> >
> >
> > processor="TikaEntityProcessor"****
> >
> >
> > url="${files.fileAbsolutePath}" ****
> >
> >
> > format="text">****
> >
> >
> > <field column="file" name="fileName"/>****
> >
> >
> > <field column="Author" name="author" meta="true"/>****
> >
> >
> > <field column="title" name="title" meta="true"/>****
> >
> >
> > <field column="text" name="text"/>****
> >
> >
> > </entity>*
> > ***
> >
> >                                 </entity>****
> >
> >                                 </document> ****
> >
> > </dataConfig>  ****
> >
> > ** **
> >
> > ** **
> >
> > --****
> >
> > Gian Maria Ricci****
> >
> > Mobile: +39 320 0136949****
> >
> > <http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635>
> [image:
> > https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-
> > rn 
> > uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]<http://www.linkedin.com/in/gia
> > nm
> > ariaricci>
> >  [image:
> > https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw
> > 1I 
> > 59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]<https://twitter.com/alkampfer>
> >  [image:
> > https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypK
> > MU 
> > Tub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]<http://feeds.feedburner.com/Al
> > ka
> > mpferEng>
> >  [image:
> > https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470
> > xf
> > DtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3]
> > ****
> >
> > ** **
> >
> > ** **
> >
>

RE: Tika: How can I import automatically all metadata without specifiying them explicitly

Reply via email to