You can source that from the cirrussearch dumps, which contain the text
already cleaned up. The python looks something like:
import json
from itertools import zip_longest
from pprint import pprint
import requests
import zlib
def get_gzip_stream(url):
with requests.get(url, stream=True) as res:
d = zlib.decompressobj(16+zlib.MAX_WBITS)
for data in res.iter_content():
yield d.decompress(data).decode('utf8')
def decode_lines(stream):
buf = []
for data in stream:
buf.append(data)
if '\n' in data:
line, tail = ''.join(buf).split('\n', 1)
buf = [tail]
yield json.loads(line)
if buf:
yield json.loads(''.join(buf))
def pair_up_lines(lines):
return zip_longest(*([iter(lines)] * 2))
url = '
https://dumps.wikimedia.org/other/cirrussearch/20180723/enwiki-20180723-cirrussearch-content.json.gz
'
stream = get_gzip_stream(url)
stream = decode_lines(stream)
stream = pair_up_lines(stream)
for meta, doc in stream:
print(meta['index']['_id'])
print(doc['title'])
print(doc['text'])
On Tue, Jul 24, 2018 at 7:23 AM Nikhil Prakash <[email protected]>
wrote:
> Hi There,
>
> I'm searching for some efficient way to convert the WikiText of the
> downloaded data dumps(in XML) to plain text. I basically need plain text of
> each and every revision of Wikipedia articles.
>
> Therefore, it would be very helpful if you can tell me about some library
> or some piece of code(bunch of regex) to convert WikiText to Plain Text.
> BTW, I write my code in Python!
>
> Thanks.
> _______________________________________________
> MediaWiki-l mailing list
> To unsubscribe, go to:
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>
_______________________________________________
MediaWiki-l mailing list
To unsubscribe, go to:
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l