Update on the matter. I've edited pom.xml and changed xerces version which
was set to 2.7.1 to 2.9.1, 2.11.0, 2.8.0 and other versions.

The out of bound error becomes different on later versions but still the
error persists.
Also, I tried to use mwdumper with an older version of wikipedia dump:
20130102.

The error still appears on the first file this
time: enwiki-20130102-pages-meta-history1.xml-p000000010p000002070.7z

Should I report a new bug on bugzilla for mwdumper?

Michael


On Mon, May 20, 2013 at 4:49 PM, Michael Tsikerdekis
<[email protected]>wrote:

> great! at least we know what's causing it. I've seen the thread about
> xerces before but it was too old so I thought there is probably no relation.
>
> Let me know when there is a new build to try out or anything else I can do
> to help fix the problem.
>
> Michael
>
>
> On Mon, May 20, 2013 at 4:41 PM, Ariel T. Glenn <[email protected]>wrote:
>
>> Στις 20-05-2013, ημέρα Δευ, και ώρα 13:18 +0200, ο/η Michael Tsikerdekis
>> έγραψε:
>>
>> > 33 pages (0.593/sec), 25,374 revs (455.695/sec)
>> > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:
>> 2048
>> >         at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
>> >         at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
>> >         at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown
>> > Source)
>> >         at
>> >
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
>> > Source)
>> >         at
>> >
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
>> > Source)
>> >         at
>> >
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
>> > Source)
>> >         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
>> > Source)
>> >         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
>> > Source)
>> >         at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>> >         at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
>> Source)
>> >         at
>> org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
>> > Source)
>> ...
>>
>> The file itself is fine; proof of that is that I isolated the
>> problematic page, removed the first revision (which had been processed
>> without problems) and then all remaining revisions including the 'bad'
>> one were handled properly.
>>
>> This is most likely a regression:
>> http://www.gossamer-threads.com/lists/wiki/mediawiki/128069
>> Our spec says to build against maven's xerces version 2.7.1, and I
>> expect that never got the patch [1].  I'm not sure what version of the
>> xerces library is good ([2]).
>>
>> I'm adding Chad back on the cc though since he'll have to update the
>> build specs.  Chad, do you want a bugzilla report for this?
>>
>> Ariel
>>
>> [1] http://www.gossamer-threads.com/lists/wiki/mediawiki/128069
>> [2]
>>
>> https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697205#action_12697205
>>
>>
>>
>>
>> _______________________________________________
>> MediaWiki-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>>
>
>
_______________________________________________
MediaWiki-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Reply via email to