Traversal help

R. Bryan Smith Sun, 02 Apr 2017 21:56:55 -0700

Hello,

I am working with Python 3.6.  I’ve been trying to figure out a solution to my 
question for about 40 hrs with no success and hundreds of failed attempts.  
Essentially, I have bitten off way more than I can chew with processing this 
file.  Most of what follows, is my attempt to inform as best I can figure.


I have a JSONL (new line) file that I downloaded using requests and the 
following code: 
with open(fname, 'wb') as fd:
    for chunk in r.iter_content(chunk_size=1024):
        fd.write(chunk)

The file was in gzip format (the encoding on the API says - UTF-8) to a windows 
8 (it’s current, but 8) machine.
These files are rather large, maybe around 4GB.
I used the ‘shebang' for ‘UTF-8’ at the top of my Python program: # -*- 
encoding: utf-8 -*-

After I save the file, I read it using this:
def read_json(path):
    '''Turns a normal json LINES (cr) file into an array of objects'''
    temp_array = []
    f = codecs.open(path, 'r', 'utf-8', ‘backslashreplace')
    for line in f:
        record = json.loads(line.strip('\n|\r'))
        temp_array.append(record)
    return temp_array

I am working on a Linux server to partitioning the List returned above, there 
are three linked levels of detail (A, B, C) that can exist in any collection 
within the JSON and each Collection can contain wildly varying and/or repeating 
fields.  The data contained within is scraped from websites all over the world. 
 I wanted to ‘traverse' the file structure and found an algorithm that I think 
will work:

def traverser(obj, path=None, callback=None):
    if path is None:
        path = []
        
    if isinstance(obj, dict):
        value = {k: traverser(v, path+[k], callback)
                 for k, v in obj.items()}
    elif isinstance(obj, list):
        value = [traverser(elem, path+[[]], callback)
                 for elem in obj]
    else:
        value = obj
    
    if callback is None:
        return value
    else:
        return callback(path, value)

The only problem and the subsequent question that follows is:  I have yet to 
successfully decode / How do I then ‘collect’ each of these objects while I am 
traversing the JSON New Line Collection into some sort of container (handling 
encoding errors) so that I can then write to a csv file (w/ ‘utf-8’ and won’t 
error out when I try to import it into a IBM ‘utf-8’ encoded DB)?  Actually, 
after that, I would like to learn how to grab a specific element, if present in 
each Collection, whenever I need it, as well - but, that can wait.

I’ve tried using the JSON module on the JSONL file, but the structure is really 
complicated and changing with lot’s of different control and spacing 
characters, in addition to some odd (potentially non-unicode characters).  
Here’s the schema: http://json-schema.org/fraft-04/schema# 
<http://json-schema.org/fraft-04/schema#> 

I’m not a programmer, but I am learning through assimilation.  Any help is 
greatly appreciated.  Even if it’s pointing me to documentation that can help 
me learn what to consider and lead me to what to do.  

Thank you,
R. Smith

-- 
https://mail.python.org/mailman/listinfo/python-list

Traversal help

Reply via email to