Somebody any idea? Solr seems to ignore the DTD definition and therefore
does not understand the entities likeü orä that are defined in
dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD
definition?
On Fri, 06 Jul 2012 10:58:59 +0200, Michael Belenki<v...@belenki.name>
wrote:
Dear community,
I am experiencing strange problem while trying to index / to import XML
document to SOLR via DataImportHandler. The XML document contains some
special characters (e.g. german ü) that are represented as XML entities
ü or ä. There is also DTD file that defines these entities
(<!ENTITY uuml "ü">) (I tried to use dtd file as well as to
include the DTD definition to the xml itself). After I start the import
command full-import, the import process throws an exception as soon as
it
tries to parse ü: "Un
declared general entity "uuml". Did anyone already face such a problem?
best regards,
Michael
My data-config for importing is:
<dataConfig>
<dataSource type="FileDataSource" encoding="ISO-8859-1" />
<document>
<!-- stream should be true since huge xml document is being
parsed
-->
<entity name="article"
processor="XPathEntityProcessor"
stream="true"
forEach="/dblp/article"
url="documents/dblp.xml"
>
<field column="key" xpath="/dblp/article/@key" />
<field column="title" xpath="/dblp/article/title" />
</entity>
</document>
</dataConfig>
The XML file looks e.g. like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp [
<!ENTITY uuml "ü"><!-- small u, dieresis or umlaut mark -->
]>
<dblp>
<article key="journals/fm/Riccardi09" mdate="2011-10-27">
<author>Marco Riccardi</author>
<title>Solution of Cubic and Quartic Equations.ü</title>
<pages>117-122</pages>
<year>2009</year>
<volume>17</volume>
<journal>Formalized Mathematics</journal>
<number>1-4</number>
<ee>http://dx.doi.org/10.2478/v10037-009-0012-z</ee><url>db/journals/fm/fm17.html#Riccardi09</url>
</article></dblp>
The stack-trace is:
05.07.2012 17:37:19 org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {deleteByQuery=*:*,add=[persons/Codd71a, persons/Hall74]} 0 1
05.07.2012 17:37:19 org.apache.solr.common.SolrException log
SCHWERWIEGEND: Full Import failed:java.lang.RuntimeException:
java.lang.RuntimeE
xception: org.apache.solr.handler.dataimport.DataImportHandlerException:
Parsing
failed for xml, url:documents/dblp.xml rows processed in this xml:2
last
row in
this xml:{title=Common Subexpression Identification in General
Algebraic
System
s., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
:264)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
rter.java:375)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
ava:445)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja
va:426)
Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataIm
portHandlerException: Parsing failed for xml, url:documents/dblp.xml
rows
proces
sed in this xml:2 last row in this xml:{title=Common Subexpression
Identificatio
n in General Algebraic Systems., $forEach=/dblp/article,
key=persons/Hall74} Pro
cessing Document # 3
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:621)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
ava:327)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
:225)
... 3 more
Caused by:
org.apache.solr.handler.dataimport.DataImportHandlerException:
Parsin
g failed for xml, url:documents/dblp.xml rows processed in this xml:2
last
row i
n this xml:{title=Common Subexpression Identification in General
Algebraic
Syste
ms., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3
at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
Throw(DataImportHandlerException.java:72)
at
org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE
ntityProcessor.java:504)
at
org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE
ntityProcessor.java:517)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(Entity
ProcessorBase.java:120)
at
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(
XPathEntityProcessor.java:225)
at
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPath
EntityProcessor.java:204)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(Ent
ityProcessorWrapper.java:330)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
ityProcessorWrapper.java:296)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:683)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:619)
... 5 more
Caused by: java.lang.RuntimeException:
com.ctc.wstx.exc.WstxParsingException: Un
declared general entity "uuml"
at [row,col {unknown-source}]: [26,42]
at
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XP
athRecordReader.java:187)
at
org.apache.solr.handler.dataimport.XPathEntityProcessor$2.run(XPathEn
tityProcessor.java:427)
Caused by: com.ctc.wstx.exc.WstxParsingException: Undeclared general
entity "uum
l"
at [row,col {unknown-source}]: [26,42]
at
com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.jav
a:630)
at
com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:467)
at
com.ctc.wstx.sr.BasicStreamReader.handleUndeclaredEntity(BasicStreamR
eader.java:5431)
at
com.ctc.wstx.sr.StreamScanner.expandUnresolvedEntity(StreamScanner.ja
va:1661)
at
com.ctc.wstx.sr.StreamScanner.expandEntity(StreamScanner.java:1555)
at
com.ctc.wstx.sr.StreamScanner.fullyResolveEntity(StreamScanner.java:1
523)
at
com.ctc.wstx.sr.BasicStreamReader.skipTokenText(BasicStreamReader.jav
a:3568)
at
com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:33
42)
at
com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java
:2622)
at
com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart
Element(XPathRecordReader.java:376)
at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath
RecordReader.java:310)
at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart
Element(XPathRecordReader.java:346)
at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath
RecordReader.java:310)
at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart
Element(XPathRecordReader.java:346)
at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath
RecordReader.java:310)
at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$200(
XPathRecordReader.java:202)
at
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XP
athRecordReader.java:184)
... 1 more
05.07.2012 17:37:19 org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
05.07.2012 17:37:19 org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback