thank you , that's why I choose to add the exact value using solarium PHP
Client, but the time out stop indexing after 30seconde:

$dir = new Folder($dossier);
$files = $dir->find('.*\.*');
foreach ($files as $file) {
    $file = new File($dir->pwd() . DS . $file);

$query = $client->createExtract();
$query->setFile($file->pwd());
$query->setCommit(true);
$query->setOmitHeader(false);

$doc = $query->createDocument();
$doc->id =$file->pwd();
$doc->name = $file->name;
$doc->title = $file->name();

$query->setDocument($doc);

2015-12-04 16:50 GMT+00:00 Erik Hatcher <erik.hatc...@gmail.com>:

> Kostali -
>
> See if the "Introspect rich document parsing and extraction” section of
> http://lucidworks.com/blog/2015/08/04/solr-5-new-binpost-utility/
> helps*.  You’ll be able to see the output of /update/extract (aka Tika) and
> adjust your mappings and configurations accordingly.
>
> * And apologies that bin/post isn’t Windows savvy at this point, but
> you’ve got the hang of the Windows-compatible command-line it looks like.
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com
>
>
>
> > On Dec 4, 2015, at 11:44 AM, kostali hassan <med.has.kost...@gmail.com>
> wrote:
> >
> > thank you Erick, i follow you advice and take a look to config apache
> tika,
> > I have modifie my request handler /update/extract:
> >
> > <requestHandler name="/update/extract"
> >                  startup="lazy"
> >                  class="solr.extraction.ExtractingRequestHandler" >
> >    <lst name="defaults">
> >      <str name="fmap.Last-Modified">last_modified</str>
> >      <str name="uprefix">ignored_</str>
> >
> >      <!-- capture link hrefs but ignore div attributes -->
> >      <str name="captureAttr">true</str>
> >      <str name="fmap.a">links</str>
> >      <str name="fmap.div">ignored_</str>
> >    </lst>
> > <str
> >
> name="tika.config">D:\solr\solr-5.3.1\server\solr\tika-data-config.xml</str>
> >  </requestHandler>
> >
> > and config tika :
> >
> > dataConfig>
> >    <dataSource type="BinFileDataSource" />
> >    <document>
> >        <entity name="files" processor="FileListEntityProcessor"
> > dataSource="null" rootEntity="false"
> >                baseDir="D:\Lucene\document"
> > fileName=".*.(doc)|(pdf)|(docx)"
> > onError="skip"
> >            recursive="true">
> >                <field column="fileAbsolutePath" name="lux_uri" />
> >                <field column="fileSize" name="size" />
> >                <field column="fileLastModified" name="lastModified" />
> >
> >               <entity
> >                    name="documentImport"
> >                    processor="TikaEntityProcessor"
> >                    url="${files.fileAbsolutePath}"
> >                    format="text">
> >                    <field column="file" name="fileName" meta="true"/>
> >                    <field column="Author" name="author" meta="true"/>
> >                    <field column="name" name="name" meta="true"/>
> > <field column="title" name="title" meta="true"/>
> >                    <field column="text" name="text"/>
> >                    <field column="custom:Testmeta" name="Testmeta"
> > meta="true"/>
> >                    <field column="LastModifiedBy" name="LastModifiedBy"
> > meta="true"/>
> >                </entity>
> >        </entity>
> >    </document>
> > </dataConfig>
> >
> > and schema.xml:
> >
> > <field name="Testmeta" type="text" indexed="true" stored="true" />
> >
> >
> >
> > but the prb is the same title of indexed files is wrong for msword
>
>

Reply via email to