Thanks Ariel. One small thing, where exactly can I report it upstream? got a url?
Michael On Tue, May 21, 2013 at 5:45 PM, Ariel T. Glenn <[email protected]> wrote: > If you can stomach it I would report it upstream, linking to the earlier > version of the bug they had with a proposed patch etc. I can give them > a test file consisting of the one page with all its revisions, "only" > 170 mb uncompressed :-D > > It's fine to open a report locally too in mwdumper and link the upstream > report. > > Thanks, > > Ariel > > Στις 21-05-2013, ημέρα Τρι, και ώρα 15:57 +0200, ο/η Michael Tsikerdekis > έγραψε: > > Update on the matter. I've edited pom.xml and changed xerces version > which > > was set to 2.7.1 to 2.9.1, 2.11.0, 2.8.0 and other versions. > > > > The out of bound error becomes different on later versions but still the > > error persists. > > Also, I tried to use mwdumper with an older version of wikipedia dump: > > 20130102. > > > > The error still appears on the first file this > > time: enwiki-20130102-pages-meta-history1.xml-p000000010p000002070.7z > > > > Should I report a new bug on bugzilla for mwdumper? > > > > Michael > > > > > > On Mon, May 20, 2013 at 4:49 PM, Michael Tsikerdekis > > <[email protected]>wrote: > > > > > great! at least we know what's causing it. I've seen the thread about > > > xerces before but it was too old so I thought there is probably no > relation. > > > > > > Let me know when there is a new build to try out or anything else I > can do > > > to help fix the problem. > > > > > > Michael > > > > > > > > > On Mon, May 20, 2013 at 4:41 PM, Ariel T. Glenn <[email protected] > >wrote: > > > > > >> Στις 20-05-2013, ημέρα Δευ, και ώρα 13:18 +0200, ο/η Michael > Tsikerdekis > > >> έγραψε: > > >> > > >> > 33 pages (0.593/sec), 25,374 revs (455.695/sec) > > >> > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: > > >> 2048 > > >> > at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) > > >> > at org.apache.xerces.impl.XMLEntityScanner.load(Unknown > Source) > > >> > at > org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown > > >> > Source) > > >> > at > > >> > > > >> > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown > > >> > Source) > > >> > at > > >> > > > >> > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown > > >> > Source) > > >> > at > > >> > > > >> > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > > >> > Source) > > >> > at > org.apache.xerces.parsers.XML11Configuration.parse(Unknown > > >> > Source) > > >> > at > org.apache.xerces.parsers.XML11Configuration.parse(Unknown > > >> > Source) > > >> > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > > >> > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown > > >> Source) > > >> > at > > >> org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown > > >> > Source) > > >> ... > > >> > > >> The file itself is fine; proof of that is that I isolated the > > >> problematic page, removed the first revision (which had been processed > > >> without problems) and then all remaining revisions including the 'bad' > > >> one were handled properly. > > >> > > >> This is most likely a regression: > > >> http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 > > >> Our spec says to build against maven's xerces version 2.7.1, and I > > >> expect that never got the patch [1]. I'm not sure what version of the > > >> xerces library is good ([2]). > > >> > > >> I'm adding Chad back on the cc though since he'll have to update the > > >> build specs. Chad, do you want a bugzilla report for this? > > >> > > >> Ariel > > >> > > >> [1] http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 > > >> [2] > > >> > > >> > https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697205#action_12697205 > > >> > > >> > > >> > > >> > > >> _______________________________________________ > > >> MediaWiki-l mailing list > > >> [email protected] > > >> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l > > >> > > > > > > > > _______________________________________________ > > MediaWiki-l mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l > > > > _______________________________________________ > MediaWiki-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l > _______________________________________________ MediaWiki-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
