What about "recursive=true"? Do you have subdirectories that could make a difference. Your SimplePostTool would not look at subdirectories (great comparison, BTW).
However, you do have lots of mapping options as well with /update/extract handler, look at the example and documentations. There is lots of mapping there. Regards, Alex. ---- Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 12:24, Gary Taylor <g...@inovem.com> wrote: > Alex, > > Thanks for the suggestions. It always just indexes 1 doc, regardless of the > first epub file it sees. Debug / verbose don't show anything obvious to me. > I can include the output here if you think it would help. > > I tried using the SimplePostTool first ( *java -Dtype=application/epub+zip > -Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar > \Users\gt\Documents\epub\*.epub) to index the docs and check the Tika > parsing and that works OK so I don't think it's the e*pubs. > > I was trying to use DIH so that I could more easily specify the schema > fields and store content in the index in preparation for trying out the > search highlighting. Couldn't work out how to do that with post.jar .... > > Thanks, > Gary > > > On 25/02/2015 17:09, Alexandre Rafalovitch wrote: >> >> Try removing that first epub from the directory and rerunning. If you >> now index 0 documents, then there is something unexpected about them >> and DIH skips. If it indexes 1 document again but a different one, >> then it is definitely something about the repeat logic. >> >> Also, try running with debug and verbose modes and see if something >> specific shows up. >> >> Regards, >> Alex. >> ---- >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: >> http://www.solr-start.com/ >> >> >> On 25 February 2015 at 11:14, Gary Taylor <g...@inovem.com> wrote: >>> >>> I can't get the FileListEntityProcessor and TikeEntityProcessor to >>> correctly >>> add a Solr document for each epub file in my local directory. >>> >>> I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran "solr start" >>> and >>> then "solr create -c hn2" to create a new core. >>> >>> I want to index a load of epub files that I've got in a directory. So I >>> created a data-import.xml (in solr\hn2\conf): >>> >>> <dataConfig> >>> <dataSource type="BinFileDataSource" name="bin" /> >>> <document> >>> <entity name="files" dataSource="null" rootEntity="false" >>> processor="FileListEntityProcessor" >>> baseDir="c:/Users/gt/Documents/epub" fileName=".*epub" >>> onError="skip" >>> recursive="true"> >>> <field column="fileAbsolutePath" name="id" /> >>> <field column="fileSize" name="size" /> >>> <field column="fileLastModified" name="lastModified" /> >>> >>> <entity name="documentImport" >>> processor="TikaEntityProcessor" >>> url="${files.fileAbsolutePath}" format="text" >>> dataSource="bin" onError="skip"> >>> <field column="file" name="fileName"/> >>> <field column="Author" name="author" meta="true"/> >>> <field column="title" name="title" meta="true"/> >>> <field column="text" name="content"/> >>> </entity> >>> </entity> >>> </document> >>> </dataConfig> >>> >>> In my solrconfig.xml, I added a requestHandler entry to reference my >>> data-import.xml: >>> >>> <requestHandler name="/dataimport" >>> class="org.apache.solr.handler.dataimport.DataImportHandler"> >>> <lst name="defaults"> >>> <str name="config">data-import.xml</str> >>> </lst> >>> </requestHandler> >>> >>> I renamed managed-schema to schema.xml, and ensured the following doc >>> fields >>> were setup: >>> >>> <field name="id" type="string" indexed="true" stored="true" >>> required="true" multiValued="false" /> >>> <field name="fileName" type="string" indexed="true" stored="true" >>> /> >>> <field name="author" type="string" indexed="true" stored="true" /> >>> <field name="title" type="string" indexed="true" stored="true" /> >>> >>> <field name="size" type="long" indexed="true" stored="true" /> >>> <field name="lastModified" type="date" indexed="true" >>> stored="true" /> >>> >>> <field name="content" type="text_en" indexed="false" stored="true" >>> multiValued="false"/> >>> <field name="text" type="text_en" indexed="true" stored="false" >>> multiValued="true"/> >>> >>> <copyField source="content" dest="text"/> >>> >>> I copied all the jars from dist and contrib\* into server\solr\lib. >>> >>> Stopping and restarting solr then creates a new managed-schema file and >>> renames schema.xml to schema.xml.back >>> >>> All good so far. >>> >>> Now I go to the web admin for dataimport >>> (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and >>> execute a full import. >>> >>> But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" >>> - >>> ie. it only adds one document (the very first one) even though it's >>> iterated >>> over 58! >>> >>> No errors are reported in the logs. >>> >>> I can search on the contents of that first epub document, so it's >>> extracting >>> OK in Tika, but there's a problem somewhere in my config that's causing >>> only >>> 1 document to be indexed in Solr. >>> >>> Thanks for any assistance / pointers. >>> >>> Regards, >>> Gary >>> >>> -- >>> Gary Taylor | www.inovem.com | www.kahootz.com >>> >>> INOVEM Ltd is registered in England and Wales No 4228932 >>> Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE >>> kahootz.com is a trading name of INOVEM Ltd. >>> > > -- > Gary Taylor | www.inovem.com | www.kahootz.com > > INOVEM Ltd is registered in England and Wales No 4228932 > Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE > kahootz.com is a trading name of INOVEM Ltd. >