I fixe the prb using requestHandler dataimoprt: <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">tika-data-config.xml</str> </lst> </requestHandler>
I configure the tika-data-config.xml according to my needs to get the right value : <dataConfig> <dataSource type="BinFileDataSource" /> <document> <entity name="files" processor="FileListEntityProcessor" dataSource="null" rootEntity="false" baseDir="D:\Lucene\document" fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)" onError="skip" recursive="true"> <field column="fileAbsolutePath" name="id" /> <field column="fileSize" name="size" /> <field column="fileLastModified" name="lastModified" /> <field column="file" name="fileName" /> now dont need indexing from Commandline using simpleposttool just go to to the web admin for dataimport and try and execute a full import. 2015-12-04 17:05 GMT+00:00 kostali hassan <med.has.kost...@gmail.com>: > thank you , that's why I choose to add the exact value using solarium PHP > Client, but the time out stop indexing after 30seconde: > > $dir = new Folder($dossier); > $files = $dir->find('.*\.*'); > foreach ($files as $file) { > $file = new File($dir->pwd() . DS . $file); > > $query = $client->createExtract(); > $query->setFile($file->pwd()); > $query->setCommit(true); > $query->setOmitHeader(false); > > $doc = $query->createDocument(); > $doc->id =$file->pwd(); > $doc->name = $file->name; > $doc->title = $file->name(); > > $query->setDocument($doc); > > 2015-12-04 16:50 GMT+00:00 Erik Hatcher <erik.hatc...@gmail.com>: > >> Kostali - >> >> See if the "Introspect rich document parsing and extraction” section of >> http://lucidworks.com/blog/2015/08/04/solr-5-new-binpost-utility/ >> helps*. You’ll be able to see the output of /update/extract (aka Tika) and >> adjust your mappings and configurations accordingly. >> >> * And apologies that bin/post isn’t Windows savvy at this point, but >> you’ve got the hang of the Windows-compatible command-line it looks like. >> >> — >> Erik Hatcher, Senior Solutions Architect >> http://www.lucidworks.com >> >> >> >> > On Dec 4, 2015, at 11:44 AM, kostali hassan <med.has.kost...@gmail.com> >> wrote: >> > >> > thank you Erick, i follow you advice and take a look to config apache >> tika, >> > I have modifie my request handler /update/extract: >> > >> > <requestHandler name="/update/extract" >> > startup="lazy" >> > class="solr.extraction.ExtractingRequestHandler" > >> > <lst name="defaults"> >> > <str name="fmap.Last-Modified">last_modified</str> >> > <str name="uprefix">ignored_</str> >> > >> > <!-- capture link hrefs but ignore div attributes --> >> > <str name="captureAttr">true</str> >> > <str name="fmap.a">links</str> >> > <str name="fmap.div">ignored_</str> >> > </lst> >> > <str >> > >> name="tika.config">D:\solr\solr-5.3.1\server\solr\tika-data-config.xml</str> >> > </requestHandler> >> > >> > and config tika : >> > >> > dataConfig> >> > <dataSource type="BinFileDataSource" /> >> > <document> >> > <entity name="files" processor="FileListEntityProcessor" >> > dataSource="null" rootEntity="false" >> > baseDir="D:\Lucene\document" >> > fileName=".*.(doc)|(pdf)|(docx)" >> > onError="skip" >> > recursive="true"> >> > <field column="fileAbsolutePath" name="lux_uri" /> >> > <field column="fileSize" name="size" /> >> > <field column="fileLastModified" name="lastModified" /> >> > >> > <entity >> > name="documentImport" >> > processor="TikaEntityProcessor" >> > url="${files.fileAbsolutePath}" >> > format="text"> >> > <field column="file" name="fileName" meta="true"/> >> > <field column="Author" name="author" meta="true"/> >> > <field column="name" name="name" meta="true"/> >> > <field column="title" name="title" meta="true"/> >> > <field column="text" name="text"/> >> > <field column="custom:Testmeta" name="Testmeta" >> > meta="true"/> >> > <field column="LastModifiedBy" name="LastModifiedBy" >> > meta="true"/> >> > </entity> >> > </entity> >> > </document> >> > </dataConfig> >> > >> > and schema.xml: >> > >> > <field name="Testmeta" type="text" indexed="true" stored="true" /> >> > >> > >> > >> > but the prb is the same title of indexed files is wrong for msword >> >> >