Re: Tika: How can I import automatically all metadata without specifiying them explicitly

Erick Erickson Sun, 26 May 2013 08:36:13 -0700

In addition to Alexandre's comment:

bq:  ...I’d like to import in my index all metadata


Be a little careful here, this isn't actually very useful in my experience.
Sure
it's nice to have all that data in the index, but... how do you search it
meaningfully?

Consider that some doc may have an "author" metadata field. Another may have
a "last editor" field. Yet another may have a "main author" field. If you
add all
these as their field name, what do you do to search for "author"? Somehow
you
have to create a mapping between the various metadata names and something
that's searchable, why not do this at index time?

Not to mention I've seen this done and the result may be literally hundreds
of
different metadata fields which are not very useful.

All that said, it may be perfectly valid to inde them all, but before going
there
it's worth considering whether the result is actually _useful_.

Best
Erick


On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci
<alkamp...@nablasoft.com>wrote:

> Hi to everyone,****
>
> ** **
>
> I’ve configured import of a document folder with FileListEntityProcessor,
> everything went smooth on the first try, but I have a simple question. I’m
> able to map metadata without any problem, but I’d like to import in my
> index all metadata, not only those I’ve configured with field nodes. In
> this example I’ve imported Author and title, but I does not know in advance
> which metadata a document could have and I wish to have all of them inside
> my index.****
>
> ** **
>
> Here is my import config. It is the first try with importing with tika and
> probably I’m missing a simple stuff.****
>
> ** **
>
> <dataConfig>  ****
>
>                 <dataSource type="BinFileDataSource" />****
>
>                                 <document>****
>
>                                                 <entity name="files"
> dataSource="null" rootEntity="false"****
>
>
> processor="FileListEntityProcessor" ****
>
>                                                 baseDir="c:/temp/docs"
> fileName=".*\.(doc)|(pdf)|(docx)"****
>
>                                                 onError="skip"****
>
>                                                 recursive="true">****
>
>                                                                 <field
> column="file" name="id" />****
>
>                                                                 <field
> column="fileAbsolutePath" name="path" />****
>
>                                                                 <field
> column="fileSize" name="size" />****
>
>                                                                 <field
> column="fileLastModified" name="lastModified" />****
>
>                                                                 ****
>
>                                                                 <entity **
> **
>
>
> name="documentImport" ****
>
>
> processor="TikaEntityProcessor"****
>
>
> url="${files.fileAbsolutePath}" ****
>
>
> format="text">****
>
>
> <field column="file" name="fileName"/>****
>
>
> <field column="Author" name="author" meta="true"/>****
>
>
> <field column="title" name="title" meta="true"/>****
>
>
> <field column="text" name="text"/>****
>
>                                                                 </entity>*
> ***
>
>                                 </entity>****
>
>                                 </document> ****
>
> </dataConfig>  ****
>
> ** **
>
> ** **
>
> --****
>
> Gian Maria Ricci****
>
> Mobile: +39 320 0136949****
>
> <http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635> [image:
> https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rnuVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]<http://www.linkedin.com/in/gianmariaricci>
>  [image:
> https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]<https://twitter.com/alkampfer>
>  [image:
> https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMUTub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]<http://feeds.feedburner.com/AlkampferEng>
>  [image:
> https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xfDtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3]
> ****
>
> ** **
>
> ** **
>

Re: Tika: How can I import automatically all metadata without specifiying them explicitly

Reply via email to