Thanks a lot, other useful hints, and probably standalone Tika could be a solution.
I've another little question: how can I express filters in DIH configuration to run import of the server incrementally? Actually I've two distinct scenario. In first scenario I've documents stored inside database, so I need to write a DIH to import data from database and since I have timestamp column this is not a problem. Second scenario: need to monitor one folder, and do incremental population each 15 minutes. Usually with Sql DIH I use some column as a filter to do incremental population, but I wonder if it is possible to pass filter to BinFileDataSource, telling to process only new files and those modified after a timestamp (last run). Thanks again for all your precious suggestions. -- Gian Maria Ricci Mobile: +39 320 0136949 -----Original Message----- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Monday, May 27, 2013 1:44 PM To: solr-user@lucene.apache.org Subject: RE: Tika: How can I import automatically all metadata without specifiying them explicitly Standalone Tika can also run in a network server mode. That increases data roundtrips but gives you more options. Even in .net . Regards, Alex On 27 May 2013 04:22, "Gian Maria Ricci" <alkamp...@nablasoft.com> wrote: > Thanks for the help. > > @Alexandre: Thanks for the suggestion, I'll try to use an > ExtractingRequestHandler, I thought that I was missing some DIH option :). > > @Erik: I'm interested in knowing them all to do various form of > analysis. I have documents coming from heterogeneous sources and I'm > interested in searching inside the content, but also being able to > extract all possible metadata. I'm working in .Net so it is useful > letting tika doing everything for me directly in solr and then > retrieve all metadata for matched documents. > > Thanks again to everyone. > > -- > Gian Maria Ricci > Mobile: +39 320 0136949 > > > > -----Original Message----- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Sunday, May 26, 2013 5:30 PM > To: solr-user@lucene.apache.org; Gian Maria Ricci > Subject: Re: Tika: How can I import automatically all metadata without > specifiying them explicitly > > In addition to Alexandre's comment: > > bq: ...I'd like to import in my index all metadata > > Be a little careful here, this isn't actually very useful in my experience. > Sure > it's nice to have all that data in the index, but... how do you search > it meaningfully? > > Consider that some doc may have an "author" metadata field. Another > may have a "last editor" field. Yet another may have a "main author" > field. If you add all these as their field name, what do you do to > search for "author"? > Somehow you have to create a mapping between the various metadata > names and something that's searchable, why not do this at index time? > > Not to mention I've seen this done and the result may be literally > hundreds of different metadata fields which are not very useful. > > All that said, it may be perfectly valid to inde them all, but before > going there it's worth considering whether the result is actually _useful_. > > Best > Erick > > > On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci > <alkamp...@nablasoft.com>wrote: > > > Hi to everyone,**** > > > > ** ** > > > > I've configured import of a document folder with > > FileListEntityProcessor, everything went smooth on the first try, > > but I have a simple question. I'm able to map metadata without any > > problem, but I'd like to import in my index all metadata, not only > > those I've configured with field nodes. In this example I've > > imported Author and title, but I does not know in advance which > > metadata a document could have and I wish to have all of them inside > > my > > index.**** > > > > ** ** > > > > Here is my import config. It is the first try with importing with > > tika and probably I'm missing a simple stuff.**** > > > > ** ** > > > > <dataConfig> **** > > > > <dataSource type="BinFileDataSource" />**** > > > > <document>**** > > > > <entity name="files" > > dataSource="null" rootEntity="false"**** > > > > > > processor="FileListEntityProcessor" **** > > > > baseDir="c:/temp/docs" > > fileName=".*\.(doc)|(pdf)|(docx)"**** > > > > onError="skip"**** > > > > > > recursive="true">**** > > > > > > <field column="file" name="id" />**** > > > > > > <field column="fileAbsolutePath" name="path" />**** > > > > > > <field column="fileSize" name="size" />**** > > > > > > <field column="fileLastModified" name="lastModified" />**** > > > > **** > > > > > > <entity ** > > ** > > > > > > name="documentImport" **** > > > > > > processor="TikaEntityProcessor"**** > > > > > > url="${files.fileAbsolutePath}" **** > > > > > > format="text">**** > > > > > > <field column="file" name="fileName"/>**** > > > > > > <field column="Author" name="author" meta="true"/>**** > > > > > > <field column="title" name="title" meta="true"/>**** > > > > > > <field column="text" name="text"/>**** > > > > > > </entity>* > > *** > > > > </entity>**** > > > > </document> **** > > > > </dataConfig> **** > > > > ** ** > > > > ** ** > > > > --**** > > > > Gian Maria Ricci**** > > > > Mobile: +39 320 0136949**** > > > > <http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635> > [image: > > https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl- > > rn > > uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]<http://www.linkedin.com/in/gia > > nm > > ariaricci> > > [image: > > https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw > > 1I > > 59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]<https://twitter.com/alkampfer> > > [image: > > https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypK > > MU > > Tub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]<http://feeds.feedburner.com/Al > > ka > > mpferEng> > > [image: > > https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470 > > xf > > DtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3] > > **** > > > > ** ** > > > > ** ** > > >