Alex,
Thanks for the suggestions. It always just indexes 1 doc, regardless of
the first epub file it sees. Debug / verbose don't show anything
obvious to me. I can include the output here if you think it would help.
I tried using the SimplePostTool first ( *java
-Dtype=application/epub+zip
-Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar
\Users\gt\Documents\epub\*.epub) to index the docs and check the Tika
parsing and that works OK so I don't think it's the e*pubs.
I was trying to use DIH so that I could more easily specify the schema
fields and store content in the index in preparation for trying out the
search highlighting. Couldn't work out how to do that with post.jar ....
Thanks,
Gary
On 25/02/2015 17:09, Alexandre Rafalovitch wrote:
Try removing that first epub from the directory and rerunning. If you
now index 0 documents, then there is something unexpected about them
and DIH skips. If it indexes 1 document again but a different one,
then it is definitely something about the repeat logic.
Also, try running with debug and verbose modes and see if something
specific shows up.
Regards,
Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/
On 25 February 2015 at 11:14, Gary Taylor <g...@inovem.com> wrote:
I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly
add a Solr document for each epub file in my local directory.
I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran "solr start" and
then "solr create -c hn2" to create a new core.
I want to index a load of epub files that I've got in a directory. So I
created a data-import.xml (in solr\hn2\conf):
<dataConfig>
<dataSource type="BinFileDataSource" name="bin" />
<document>
<entity name="files" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="c:/Users/gt/Documents/epub" fileName=".*epub"
onError="skip"
recursive="true">
<field column="fileAbsolutePath" name="id" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastModified" />
<entity name="documentImport" processor="TikaEntityProcessor"
url="${files.fileAbsolutePath}" format="text"
dataSource="bin" onError="skip">
<field column="file" name="fileName"/>
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="text" name="content"/>
</entity>
</entity>
</document>
</dataConfig>
In my solrconfig.xml, I added a requestHandler entry to reference my
data-import.xml:
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-import.xml</str>
</lst>
</requestHandler>
I renamed managed-schema to schema.xml, and ensured the following doc fields
were setup:
<field name="id" type="string" indexed="true" stored="true"
required="true" multiValued="false" />
<field name="fileName" type="string" indexed="true" stored="true" />
<field name="author" type="string" indexed="true" stored="true" />
<field name="title" type="string" indexed="true" stored="true" />
<field name="size" type="long" indexed="true" stored="true" />
<field name="lastModified" type="date" indexed="true" stored="true" />
<field name="content" type="text_en" indexed="false" stored="true"
multiValued="false"/>
<field name="text" type="text_en" indexed="true" stored="false"
multiValued="true"/>
<copyField source="content" dest="text"/>
I copied all the jars from dist and contrib\* into server\solr\lib.
Stopping and restarting solr then creates a new managed-schema file and
renames schema.xml to schema.xml.back
All good so far.
Now I go to the web admin for dataimport
(http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and
execute a full import.
But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" -
ie. it only adds one document (the very first one) even though it's iterated
over 58!
No errors are reported in the logs.
I can search on the contents of that first epub document, so it's extracting
OK in Tika, but there's a problem somewhere in my config that's causing only
1 document to be indexed in Solr.
Thanks for any assistance / pointers.
Regards,
Gary
--
Gary Taylor | www.inovem.com | www.kahootz.com
INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.
--
Gary Taylor | www.inovem.com | www.kahootz.com
INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.