Re: indexing rich data with solr 5.3.1 integreting in Ubuntu server

kostali hassan Tue, 26 Jan 2016 09:03:34 -0800

they are loaded because solr is indexing .doc and .docx (msword) and fail
for pdf files .


2016-01-26 12:49 GMT+00:00 Emir Arnautovic <emir.arnauto...@sematext.com>:

> Hi,
> I would first check if external libraries are present and loaded. How do
> you start Solr? Try explicitly setting solr.install.dir or set absolute
> path to libs and see in logs if they are loaded.
>
> <lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib"
> regex=".*\.jar" />
>
>
> Thanks,
> Emir
>
> On 25.01.2016 15:16, kostali hassan wrote:
>
>> 0down votefavorite
>> <
>> http://stackoverflow.com/questions/34962280/solr-indexing-pdf-attachments-not-working-in-ubuntu#
>> >
>>
>>
>> I have a problem with integrating solr in Ubuntu server.Before using solr
>> on ubuntu server i tested it on my mac it was working perfectly for DIH
>> request handler and update/extract. it indexed my PDF,Doc,Docx
>> documents.so
>> after installing solr on ubuntu server and using the same configuration
>> files and librairies. i've found out that solr doesn't index PDf documents
>> and none Error and any exceptions in solr log.But i can search over .Doc
>> and .Docx documents.
>>
>> here some parts of my solrconfig.xml contents :
>>
>> <lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib"
>> regex=".*\.jar" />
>>    <lib dir="${solr.install.dir:../../../..}/dist/"
>> regex="solr-cell-\d.*\.jar" />
>>
>> <requestHandler name="/update/extract"
>>                    startup="lazy"
>>                    class="solr.extraction.ExtractingRequestHandler" >
>>      <lst name="defaults">
>>        <str name="lowernames">true</str>
>>        <str name="fmap.meta">ignored_</str>
>>        <str name="fmap.content">_text_</str>
>>      </lst>
>>    </requestHandler>
>>
>> DIH config:
>>
>> <requestHandler name="/dataimport"
>> class="org.apache.solr.handler.dataimport.DataImportHandler">
>> <lst name="defaults">
>> <str name="config">tika.config.xml</str>
>> </lst>
>> </requestHandler>
>>
>> tika.config.xml
>>
>> <dataConfig>
>>      <dataSource type="BinFileDataSource" />
>>      <document>
>>          <entity name="files" processor="FileListEntityProcessor"
>> dataSource="null" rootEntity="false"
>>                  baseDir="D:\Lucene\document"
>> fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)"
>>                                 onError="skip"
>>              recursive="true">
>>                  <field column="fileAbsolutePath" name="id" />
>>                  <field column="fileSize" name="size" />
>>                  <field column="fileLastModified" name="lastModified" />
>>                   <field column="file" name="title" />
>>                 <entity
>>                      name="documentImport"
>>                                         dataSource="files"
>>                      processor="TikaEntityProcessor"
>>                      url="${files.fileAbsolutePath}"
>>                      format="text">
>>
>>
>>                      <field column="Author" name="author" meta="true"/>
>>                                         <field column="title"
>> name="title" meta="true"/>
>>                      <field column="text" name="text"/>
>>
>>                                         <field column="text"
>> name="content"/>
>>                      <field column="LastModifiedBy"
>> name="LastModifiedBy" meta="true"/>
>>                  </entity>
>>          </entity>
>>      </document>
>> </dataConfig>
>>
>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>

Re: indexing rich data with solr 5.3.1 integreting in Ubuntu server

Reply via email to