Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

Alexandre Rafalovitch Wed, 25 Feb 2015 09:34:03 -0800

What about "recursive=true"? Do you have subdirectories that could
make a difference. Your SimplePostTool would not look at
subdirectories (great comparison, BTW).


However, you do have lots of mapping options as well with
/update/extract handler, look at the example and documentations. There
is lots of mapping there.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 12:24, Gary Taylor <g...@inovem.com> wrote:
> Alex,
>
> Thanks for the suggestions.  It always just indexes 1 doc, regardless of the
> first epub file it sees.  Debug / verbose don't show anything obvious to me.
> I can include the output here if you think it would help.
>
> I tried using the SimplePostTool first ( *java -Dtype=application/epub+zip
> -Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar
> \Users\gt\Documents\epub\*.epub) to index the docs and check the Tika
> parsing and that works OK so I don't think it's the e*pubs.
>
> I was trying to use DIH so that I could more easily specify the schema
> fields and store content in the index in preparation for trying out the
> search highlighting. Couldn't work out how to do that with post.jar ....
>
> Thanks,
> Gary
>
>
> On 25/02/2015 17:09, Alexandre Rafalovitch wrote:
>>
>> Try removing that first epub from the directory and rerunning. If you
>> now index 0 documents, then there is something unexpected about them
>> and DIH skips. If it indexes 1 document again but a different one,
>> then it is definitely something about the repeat logic.
>>
>> Also, try running with debug and verbose modes and see if something
>> specific shows up.
>>
>> Regards,
>>     Alex.
>> ----
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 25 February 2015 at 11:14, Gary Taylor <g...@inovem.com> wrote:
>>>
>>> I can't get the FileListEntityProcessor and TikeEntityProcessor to
>>> correctly
>>> add a Solr document for each epub file in my local directory.
>>>
>>> I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran "solr start"
>>> and
>>> then "solr create -c hn2" to create a new core.
>>>
>>> I want to index a load of epub files that I've got in a directory. So I
>>> created a data-import.xml (in solr\hn2\conf):
>>>
>>> <dataConfig>
>>>      <dataSource type="BinFileDataSource" name="bin" />
>>>      <document>
>>>          <entity name="files" dataSource="null" rootEntity="false"
>>>              processor="FileListEntityProcessor"
>>>              baseDir="c:/Users/gt/Documents/epub" fileName=".*epub"
>>>              onError="skip"
>>>              recursive="true">
>>>              <field column="fileAbsolutePath" name="id" />
>>>              <field column="fileSize" name="size" />
>>>              <field column="fileLastModified" name="lastModified" />
>>>
>>>              <entity name="documentImport"
>>> processor="TikaEntityProcessor"
>>>                  url="${files.fileAbsolutePath}" format="text"
>>> dataSource="bin" onError="skip">
>>>                  <field column="file" name="fileName"/>
>>>                  <field column="Author" name="author" meta="true"/>
>>>                  <field column="title" name="title" meta="true"/>
>>>                  <field column="text" name="content"/>
>>>              </entity>
>>>          </entity>
>>>      </document>
>>> </dataConfig>
>>>
>>> In my solrconfig.xml, I added a requestHandler entry to reference my
>>> data-import.xml:
>>>
>>>    <requestHandler name="/dataimport"
>>> class="org.apache.solr.handler.dataimport.DataImportHandler">
>>>        <lst name="defaults">
>>>            <str name="config">data-import.xml</str>
>>>        </lst>
>>>    </requestHandler>
>>>
>>> I renamed managed-schema to schema.xml, and ensured the following doc
>>> fields
>>> were setup:
>>>
>>>        <field name="id" type="string" indexed="true" stored="true"
>>> required="true" multiValued="false" />
>>>        <field name="fileName" type="string" indexed="true" stored="true"
>>> />
>>>        <field name="author" type="string" indexed="true" stored="true" />
>>>        <field name="title" type="string" indexed="true" stored="true" />
>>>
>>>        <field name="size" type="long" indexed="true" stored="true" />
>>>        <field name="lastModified" type="date" indexed="true"
>>> stored="true" />
>>>
>>>        <field name="content" type="text_en" indexed="false" stored="true"
>>> multiValued="false"/>
>>>        <field name="text" type="text_en" indexed="true" stored="false"
>>> multiValued="true"/>
>>>
>>>      <copyField source="content" dest="text"/>
>>>
>>> I copied all the jars from dist and contrib\* into server\solr\lib.
>>>
>>> Stopping and restarting solr then creates a new managed-schema file and
>>> renames schema.xml to schema.xml.back
>>>
>>> All good so far.
>>>
>>> Now I go to the web admin for dataimport
>>> (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and
>>> execute a full import.
>>>
>>> But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1"
>>> -
>>> ie. it only adds one document (the very first one) even though it's
>>> iterated
>>> over 58!
>>>
>>> No errors are reported in the logs.
>>>
>>> I can search on the contents of that first epub document, so it's
>>> extracting
>>> OK in Tika, but there's a problem somewhere in my config that's causing
>>> only
>>> 1 document to be indexed in Solr.
>>>
>>> Thanks for any assistance / pointers.
>>>
>>> Regards,
>>> Gary
>>>
>>> --
>>> Gary Taylor | www.inovem.com | www.kahootz.com
>>>
>>> INOVEM Ltd is registered in England and Wales No 4228932
>>> Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
>>> kahootz.com is a trading name of INOVEM Ltd.
>>>
>
> --
> Gary Taylor | www.inovem.com | www.kahootz.com
>
> INOVEM Ltd is registered in England and Wales No 4228932
> Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
> kahootz.com is a trading name of INOVEM Ltd.
>

Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

Reply via email to