Hello,
I am working with Python 3.6. I’ve been trying to figure out a solution to my
question for about 40 hrs with no success and hundreds of failed attempts.
Essentially, I have bitten off way more than I can chew with processing this
file. Most of what follows, is my attempt to inform as best I can figure.
I have a JSONL (new line) file that I downloaded using requests and the
following code:
with open(fname, 'wb') as fd:
for chunk in r.iter_content(chunk_size=1024):
fd.write(chunk)
The file was in gzip format (the encoding on the API says - UTF-8) to a windows
8 (it’s current, but 8) machine.
These files are rather large, maybe around 4GB.
I used the ‘shebang' for ‘UTF-8’ at the top of my Python program: # -*-
encoding: utf-8 -*-
After I save the file, I read it using this:
def read_json(path):
'''Turns a normal json LINES (cr) file into an array of objects'''
temp_array = []
f = codecs.open(path, 'r', 'utf-8', ‘backslashreplace')
for line in f:
record = json.loads(line.strip('\n|\r'))
temp_array.append(record)
return temp_array
I am working on a Linux server to partitioning the List returned above, there
are three linked levels of detail (A, B, C) that can exist in any collection
within the JSON and each Collection can contain wildly varying and/or repeating
fields. The data contained within is scraped from websites all over the world.
I wanted to ‘traverse' the file structure and found an algorithm that I think
will work:
def traverser(obj, path=None, callback=None):
if path is None:
path = []
if isinstance(obj, dict):
value = {k: traverser(v, path+[k], callback)
for k, v in obj.items()}
elif isinstance(obj, list):
value = [traverser(elem, path+[[]], callback)
for elem in obj]
else:
value = obj
if callback is None:
return value
else:
return callback(path, value)
The only problem and the subsequent question that follows is: I have yet to
successfully decode / How do I then ‘collect’ each of these objects while I am
traversing the JSON New Line Collection into some sort of container (handling
encoding errors) so that I can then write to a csv file (w/ ‘utf-8’ and won’t
error out when I try to import it into a IBM ‘utf-8’ encoded DB)? Actually,
after that, I would like to learn how to grab a specific element, if present in
each Collection, whenever I need it, as well - but, that can wait.
I’ve tried using the JSON module on the JSONL file, but the structure is really
complicated and changing with lot’s of different control and spacing
characters, in addition to some odd (potentially non-unicode characters).
Here’s the schema: http://json-schema.org/fraft-04/schema#
<http://json-schema.org/fraft-04/schema#>
I’m not a programmer, but I am learning through assimilation. Any help is
greatly appreciated. Even if it’s pointing me to documentation that can help
me learn what to consider and lead me to what to do.
Thank you,
R. Smith
--
https://mail.python.org/mailman/listinfo/python-list