RE: Tika: How can I import automatically all metadata without specifiying them explicitly

Alexandre Rafalovitch Mon, 27 May 2013 04:45:17 -0700

Standalone Tika can also run in a network server mode.  That increases data
roundtrips but gives you more options. Even in .net .


Regards,
      Alex
On 27 May 2013 04:22, "Gian Maria Ricci" <alkamp...@nablasoft.com> wrote:

> Thanks for the help.
>
> @Alexandre: Thanks for the suggestion, I'll try to use an
> ExtractingRequestHandler, I thought that I was missing some DIH option :).
>
> @Erik: I'm interested in knowing them all to do various form of analysis. I
> have documents coming from heterogeneous sources and I'm interested in
> searching inside the content, but also being able to extract all possible
> metadata. I'm working in .Net so it is useful letting tika doing everything
> for me directly in solr and then retrieve all metadata for matched
> documents.
>
> Thanks again to everyone.
>
> --
> Gian Maria Ricci
> Mobile: +39 320 0136949
>
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Sunday, May 26, 2013 5:30 PM
> To: solr-user@lucene.apache.org; Gian Maria Ricci
> Subject: Re: Tika: How can I import automatically all metadata without
> specifiying them explicitly
>
> In addition to Alexandre's comment:
>
> bq:  ...I'd like to import in my index all metadata
>
> Be a little careful here, this isn't actually very useful in my experience.
> Sure
> it's nice to have all that data in the index, but... how do you search it
> meaningfully?
>
> Consider that some doc may have an "author" metadata field. Another may
> have
> a "last editor" field. Yet another may have a "main author" field. If you
> add all these as their field name, what do you do to search for "author"?
> Somehow you have to create a mapping between the various metadata names and
> something that's searchable, why not do this at index time?
>
> Not to mention I've seen this done and the result may be literally hundreds
> of different metadata fields which are not very useful.
>
> All that said, it may be perfectly valid to inde them all, but before going
> there it's worth considering whether the result is actually _useful_.
>
> Best
> Erick
>
>
> On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci
> <alkamp...@nablasoft.com>wrote:
>
> > Hi to everyone,****
> >
> > ** **
> >
> > I've configured import of a document folder with
> > FileListEntityProcessor, everything went smooth on the first try, but
> > I have a simple question. I'm able to map metadata without any
> > problem, but I'd like to import in my index all metadata, not only
> > those I've configured with field nodes. In this example I've imported
> > Author and title, but I does not know in advance which metadata a
> > document could have and I wish to have all of them inside my
> > index.****
> >
> > ** **
> >
> > Here is my import config. It is the first try with importing with tika
> > and probably I'm missing a simple stuff.****
> >
> > ** **
> >
> > <dataConfig>  ****
> >
> >                 <dataSource type="BinFileDataSource" />****
> >
> >                                 <document>****
> >
> >                                                 <entity name="files"
> > dataSource="null" rootEntity="false"****
> >
> >
> > processor="FileListEntityProcessor" ****
> >
> >                                                 baseDir="c:/temp/docs"
> > fileName=".*\.(doc)|(pdf)|(docx)"****
> >
> >                                                 onError="skip"****
> >
> >                                                 recursive="true">****
> >
> >                                                                 <field
> > column="file" name="id" />****
> >
> >                                                                 <field
> > column="fileAbsolutePath" name="path" />****
> >
> >                                                                 <field
> > column="fileSize" name="size" />****
> >
> >                                                                 <field
> > column="fileLastModified" name="lastModified" />****
> >
> >                                                                 ****
> >
> >
> > <entity **
> > **
> >
> >
> > name="documentImport" ****
> >
> >
> > processor="TikaEntityProcessor"****
> >
> >
> > url="${files.fileAbsolutePath}" ****
> >
> >
> > format="text">****
> >
> >
> > <field column="file" name="fileName"/>****
> >
> >
> > <field column="Author" name="author" meta="true"/>****
> >
> >
> > <field column="title" name="title" meta="true"/>****
> >
> >
> > <field column="text" name="text"/>****
> >
> >
> > </entity>*
> > ***
> >
> >                                 </entity>****
> >
> >                                 </document> ****
> >
> > </dataConfig>  ****
> >
> > ** **
> >
> > ** **
> >
> > --****
> >
> > Gian Maria Ricci****
> >
> > Mobile: +39 320 0136949****
> >
> > <http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635>
> [image:
> > https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rn
> > uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]<http://www.linkedin.com/in/gianm
> > ariaricci>
> >  [image:
> > https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I
> > 59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]<https://twitter.com/alkampfer>
> >  [image:
> > https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMU
> > Tub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]<http://feeds.feedburner.com/Alka
> > mpferEng>
> >  [image:
> > https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xf
> > DtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3]
> > ****
> >
> > ** **
> >
> > ** **
> >
>

RE: Tika: How can I import automatically all metadata without specifiying them explicitly

Reply via email to