Marek Marczykowski-Górecki wrote on Tue, Sep 18, 2018 at 20:17:03 +0200: > File "/usr/lib/python3/dist-packages/diffoscope/comparators/json.py", > line 52, in recognizes > f.read().decode('utf-8', errors='ignore'), > MemoryError > > The JSONFile.recognizes function, for context: > > @classmethod > def recognizes(cls, file): > with open(file.path, 'rb') as f: > # Try fuzzy matching for JSON files > is_text = any( > file.magic_file_type.startswith(x) > for x in ('ASCII text', 'UTF-8 Unicode text'), > ) > if is_text and not file.name.endswith('.json'): > buf = f.read(10) > if not any(x in buf for x in b'{['): > return False > f.seek(0) > > try: > file.parsed = json.loads( > f.read().decode('utf-8', errors='ignore'), > object_pairs_hook=collections.OrderedDict, > )
Slurping the file to a string object is an antipattern. Instead of using f.read() to create a 4.5GB string, it would be better to use json.load(f) to read the file incrementally. That should raise an exception rather quickly. > except ValueError: > return False > > return True > Obviously ISO file is not JSON. > The whole thing could be avoided if earlier check (if initial 10 chars > contains '[' or '{') would be executed not only on "text" files. > Any reasons for that "is_text" there? Alternatively, if is_text=False, > maybe the function should return False early? > > I can provide a patch for either option, but I'd like to know which one > of them you prefer. No opinion on these. Thanks for including the function in the report. Cheers, Daniel