Re: [MediaWiki-l] How to convert WikiText to Plain Tex

Erik Bernhardson Tue, 24 Jul 2018 10:08:16 -0700

You can source that from the cirrussearch dumps, which contain the text
already cleaned up.  The python looks something like:


import json
from itertools import zip_longest
from pprint import pprint
import requests
import zlib

def get_gzip_stream(url):
    with requests.get(url, stream=True) as res:
        d = zlib.decompressobj(16+zlib.MAX_WBITS)
        for data in res.iter_content():
            yield d.decompress(data).decode('utf8')

def decode_lines(stream):
    buf = []
    for data in stream:
        buf.append(data)
        if '\n' in data:
            line, tail = ''.join(buf).split('\n', 1)
            buf = [tail]
            yield json.loads(line)

    if buf:

        yield json.loads(''.join(buf))


def pair_up_lines(lines):
    return zip_longest(*([iter(lines)] * 2))

url = '
https://dumps.wikimedia.org/other/cirrussearch/20180723/enwiki-20180723-cirrussearch-content.json.gz
'
stream = get_gzip_stream(url)
stream = decode_lines(stream)
stream = pair_up_lines(stream)

for meta, doc in stream:
    print(meta['index']['_id'])
    print(doc['title'])
    print(doc['text'])



On Tue, Jul 24, 2018 at 7:23 AM Nikhil Prakash <[email protected]>
wrote:

>  Hi There,
>
> I'm searching for some efficient way to convert the WikiText of the
> downloaded data dumps(in XML) to plain text. I basically need plain text of
> each and every revision of Wikipedia articles.
>
> Therefore, it would be very helpful if you can tell me about some library
> or some piece of code(bunch of regex) to convert WikiText to Plain Text.
> BTW, I write my code in Python!
>
> Thanks.
> _______________________________________________
> MediaWiki-l mailing list
> To unsubscribe, go to:
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>
_______________________________________________
MediaWiki-l mailing list
To unsubscribe, go to:
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: [MediaWiki-l] How to convert WikiText to Plain Tex

Reply via email to