If you can stomach it I would report it upstream, linking to the earlier
version of the bug they had with a proposed patch etc.  I can give them
a test file consisting of the one page with all its revisions, "only"
170 mb uncompressed :-D

It's fine to open a report locally too in mwdumper and link the upstream
report.

Thanks,

Ariel

Στις 21-05-2013, ημέρα Τρι, και ώρα 15:57 +0200, ο/η Michael Tsikerdekis
έγραψε:
> Update on the matter. I've edited pom.xml and changed xerces version which
> was set to 2.7.1 to 2.9.1, 2.11.0, 2.8.0 and other versions.
> 
> The out of bound error becomes different on later versions but still the
> error persists.
> Also, I tried to use mwdumper with an older version of wikipedia dump:
> 20130102.
> 
> The error still appears on the first file this
> time: enwiki-20130102-pages-meta-history1.xml-p000000010p000002070.7z
> 
> Should I report a new bug on bugzilla for mwdumper?
> 
> Michael
> 
> 
> On Mon, May 20, 2013 at 4:49 PM, Michael Tsikerdekis
> <[email protected]>wrote:
> 
> > great! at least we know what's causing it. I've seen the thread about
> > xerces before but it was too old so I thought there is probably no relation.
> >
> > Let me know when there is a new build to try out or anything else I can do
> > to help fix the problem.
> >
> > Michael
> >
> >
> > On Mon, May 20, 2013 at 4:41 PM, Ariel T. Glenn <[email protected]>wrote:
> >
> >> Στις 20-05-2013, ημέρα Δευ, και ώρα 13:18 +0200, ο/η Michael Tsikerdekis
> >> έγραψε:
> >>
> >> > 33 pages (0.593/sec), 25,374 revs (455.695/sec)
> >> > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:
> >> 2048
> >> >         at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
> >> >         at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
> >> >         at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown
> >> > Source)
> >> >         at
> >> >
> >> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
> >> > Source)
> >> >         at
> >> >
> >> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> >> > Source)
> >> >         at
> >> >
> >> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> >> > Source)
> >> >         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
> >> > Source)
> >> >         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
> >> > Source)
> >> >         at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> >> >         at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
> >> Source)
> >> >         at
> >> org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
> >> > Source)
> >> ...
> >>
> >> The file itself is fine; proof of that is that I isolated the
> >> problematic page, removed the first revision (which had been processed
> >> without problems) and then all remaining revisions including the 'bad'
> >> one were handled properly.
> >>
> >> This is most likely a regression:
> >> http://www.gossamer-threads.com/lists/wiki/mediawiki/128069
> >> Our spec says to build against maven's xerces version 2.7.1, and I
> >> expect that never got the patch [1].  I'm not sure what version of the
> >> xerces library is good ([2]).
> >>
> >> I'm adding Chad back on the cc though since he'll have to update the
> >> build specs.  Chad, do you want a bugzilla report for this?
> >>
> >> Ariel
> >>
> >> [1] http://www.gossamer-threads.com/lists/wiki/mediawiki/128069
> >> [2]
> >>
> >> https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697205#action_12697205
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> MediaWiki-l mailing list
> >> [email protected]
> >> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
> >>
> >
> >
> _______________________________________________
> MediaWiki-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l



_______________________________________________
MediaWiki-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Reply via email to