I fixe the prb using requestHandler dataimoprt:

<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">tika-data-config.xml</str>
</lst>
</requestHandler>

I configure the tika-data-config.xml according to my needs to get the right
value :
<dataConfig>
    <dataSource type="BinFileDataSource" />
    <document>
        <entity name="files" processor="FileListEntityProcessor"
dataSource="null" rootEntity="false"
                baseDir="D:\Lucene\document"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)"
onError="skip"
            recursive="true">
                <field column="fileAbsolutePath" name="id" />
                <field column="fileSize" name="size" />
                <field column="fileLastModified" name="lastModified" />
                 <field column="file" name="fileName" />

now dont need indexing  from Commandline using simpleposttool just go to  to
the web admin for dataimport and try and execute a full import.

2015-12-04 17:05 GMT+00:00 kostali hassan <med.has.kost...@gmail.com>:

> thank you , that's why I choose to add the exact value using solarium PHP
> Client, but the time out stop indexing after 30seconde:
>
> $dir = new Folder($dossier);
> $files = $dir->find('.*\.*');
> foreach ($files as $file) {
>     $file = new File($dir->pwd() . DS . $file);
>
> $query = $client->createExtract();
> $query->setFile($file->pwd());
> $query->setCommit(true);
> $query->setOmitHeader(false);
>
> $doc = $query->createDocument();
> $doc->id =$file->pwd();
> $doc->name = $file->name;
> $doc->title = $file->name();
>
> $query->setDocument($doc);
>
> 2015-12-04 16:50 GMT+00:00 Erik Hatcher <erik.hatc...@gmail.com>:
>
>> Kostali -
>>
>> See if the "Introspect rich document parsing and extraction” section of
>> http://lucidworks.com/blog/2015/08/04/solr-5-new-binpost-utility/
>> helps*.  You’ll be able to see the output of /update/extract (aka Tika) and
>> adjust your mappings and configurations accordingly.
>>
>> * And apologies that bin/post isn’t Windows savvy at this point, but
>> you’ve got the hang of the Windows-compatible command-line it looks like.
>>
>> —
>> Erik Hatcher, Senior Solutions Architect
>> http://www.lucidworks.com
>>
>>
>>
>> > On Dec 4, 2015, at 11:44 AM, kostali hassan <med.has.kost...@gmail.com>
>> wrote:
>> >
>> > thank you Erick, i follow you advice and take a look to config apache
>> tika,
>> > I have modifie my request handler /update/extract:
>> >
>> > <requestHandler name="/update/extract"
>> >                  startup="lazy"
>> >                  class="solr.extraction.ExtractingRequestHandler" >
>> >    <lst name="defaults">
>> >      <str name="fmap.Last-Modified">last_modified</str>
>> >      <str name="uprefix">ignored_</str>
>> >
>> >      <!-- capture link hrefs but ignore div attributes -->
>> >      <str name="captureAttr">true</str>
>> >      <str name="fmap.a">links</str>
>> >      <str name="fmap.div">ignored_</str>
>> >    </lst>
>> > <str
>> >
>> name="tika.config">D:\solr\solr-5.3.1\server\solr\tika-data-config.xml</str>
>> >  </requestHandler>
>> >
>> > and config tika :
>> >
>> > dataConfig>
>> >    <dataSource type="BinFileDataSource" />
>> >    <document>
>> >        <entity name="files" processor="FileListEntityProcessor"
>> > dataSource="null" rootEntity="false"
>> >                baseDir="D:\Lucene\document"
>> > fileName=".*.(doc)|(pdf)|(docx)"
>> > onError="skip"
>> >            recursive="true">
>> >                <field column="fileAbsolutePath" name="lux_uri" />
>> >                <field column="fileSize" name="size" />
>> >                <field column="fileLastModified" name="lastModified" />
>> >
>> >               <entity
>> >                    name="documentImport"
>> >                    processor="TikaEntityProcessor"
>> >                    url="${files.fileAbsolutePath}"
>> >                    format="text">
>> >                    <field column="file" name="fileName" meta="true"/>
>> >                    <field column="Author" name="author" meta="true"/>
>> >                    <field column="name" name="name" meta="true"/>
>> > <field column="title" name="title" meta="true"/>
>> >                    <field column="text" name="text"/>
>> >                    <field column="custom:Testmeta" name="Testmeta"
>> > meta="true"/>
>> >                    <field column="LastModifiedBy" name="LastModifiedBy"
>> > meta="true"/>
>> >                </entity>
>> >        </entity>
>> >    </document>
>> > </dataConfig>
>> >
>> > and schema.xml:
>> >
>> > <field name="Testmeta" type="text" indexed="true" stored="true" />
>> >
>> >
>> >
>> > but the prb is the same title of indexed files is wrong for msword
>>
>>
>

Reply via email to