If you can stomach it I would report it upstream, linking to the earlier version of the bug they had with a proposed patch etc. I can give them a test file consisting of the one page with all its revisions, "only" 170 mb uncompressed :-D
It's fine to open a report locally too in mwdumper and link the upstream report. Thanks, Ariel Στις 21-05-2013, ημέρα Τρι, και ώρα 15:57 +0200, ο/η Michael Tsikerdekis έγραψε: > Update on the matter. I've edited pom.xml and changed xerces version which > was set to 2.7.1 to 2.9.1, 2.11.0, 2.8.0 and other versions. > > The out of bound error becomes different on later versions but still the > error persists. > Also, I tried to use mwdumper with an older version of wikipedia dump: > 20130102. > > The error still appears on the first file this > time: enwiki-20130102-pages-meta-history1.xml-p000000010p000002070.7z > > Should I report a new bug on bugzilla for mwdumper? > > Michael > > > On Mon, May 20, 2013 at 4:49 PM, Michael Tsikerdekis > <[email protected]>wrote: > > > great! at least we know what's causing it. I've seen the thread about > > xerces before but it was too old so I thought there is probably no relation. > > > > Let me know when there is a new build to try out or anything else I can do > > to help fix the problem. > > > > Michael > > > > > > On Mon, May 20, 2013 at 4:41 PM, Ariel T. Glenn <[email protected]>wrote: > > > >> Στις 20-05-2013, ημέρα Δευ, και ώρα 13:18 +0200, ο/η Michael Tsikerdekis > >> έγραψε: > >> > >> > 33 pages (0.593/sec), 25,374 revs (455.695/sec) > >> > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: > >> 2048 > >> > at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) > >> > at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) > >> > at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown > >> > Source) > >> > at > >> > > >> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown > >> > Source) > >> > at > >> > > >> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown > >> > Source) > >> > at > >> > > >> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > >> > Source) > >> > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > >> > Source) > >> > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > >> > Source) > >> > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > >> > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown > >> Source) > >> > at > >> org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown > >> > Source) > >> ... > >> > >> The file itself is fine; proof of that is that I isolated the > >> problematic page, removed the first revision (which had been processed > >> without problems) and then all remaining revisions including the 'bad' > >> one were handled properly. > >> > >> This is most likely a regression: > >> http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 > >> Our spec says to build against maven's xerces version 2.7.1, and I > >> expect that never got the patch [1]. I'm not sure what version of the > >> xerces library is good ([2]). > >> > >> I'm adding Chad back on the cc though since he'll have to update the > >> build specs. Chad, do you want a bugzilla report for this? > >> > >> Ariel > >> > >> [1] http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 > >> [2] > >> > >> https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697205#action_12697205 > >> > >> > >> > >> > >> _______________________________________________ > >> MediaWiki-l mailing list > >> [email protected] > >> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l > >> > > > > > _______________________________________________ > MediaWiki-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l _______________________________________________ MediaWiki-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
