[Numpy-discussion] Proposal - Making ndarray object JSON serializable via standardized JData annotations

2021-11-25 Thread Qianqian Fang

Dear numpy developers,

I would like to share a proposal on making ndarray JSON serializable by 
default, as detailed in this github issue:


https://github.com/numpy/numpy/issues/20461


briefly, my group and collaborators are working on a new NIH (National 
Institute of Health) funded initiative - NeuroJSON 
(http://neurojson.org) - to further disseminate a lightweight data 
annotation specification (JData 
) 
among the broad neuroimaging/scientific community. Python and numpy have 
been widely used  in 
neuroimaging data analysis pipelines (nipy, nibabel, mne-python, 
PySurfer ... ), because N-D array is THE most important data structure 
used in scientific data. However, numpy currently does not support JSON 
serialization by default. This is one of the frequently requested 
features on github (#16432, #12481).


We have developed a lightweight python modules (jdata 
, bjdata 
) to help export/import ndarray 
objects to/from JSON (and a binary JSON format - BJData 
/UBJSON 
 - to gain efficiency). The approach is to convert 
ndarray objects to a dictionary  with subfields using standardized JData 
annotation tags. The JData spec can serialize complex data structures 
such as N-D arrays (solid, sparse, complex). trees, graphs, tables etc. 
It also permits data compression. These annotations have been 
implemented in my MATLAB toolbox - JSONLab 
 - since 2011 to help import/export 
MATLAB data types, and have been broadly used among MATLAB/GNU Octave users.


Examples of these portable JSON annotation tags representing N-D arrays 
can be found at


http://openjdata.org/wiki/index.cgi?JData/Examples/Basic#2_D_arrays_in_the_annotated_format
http://openjdata.org/wiki/index.cgi?JData/Examples/Advanced

and the detailed formats on N-D array annotations can be found in the spec:

https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#annotated-storage-of-n-d-arrays


our current python module to encode/decode ndarray to JSON serializable 
forms are implemented in these compact functions (handling lossless 
type/data conversion and data compression)


https://github.com/NeuroJSON/pyjdata/blob/63301d41c7b97fc678fa0ab0829f76c762a16354/jdata/jdata.py#L72-L97
https://github.com/NeuroJSON/pyjdata/blob/63301d41c7b97fc678fa0ab0829f76c762a16354/jdata/jdata.py#L126-L160

We strongly believe that enabling JSON serialization by default will 
benefit the numpy user community, making it a lot easier to share 
complex data between platforms (MATLAB/Python/C/FORTRAN/JavaScript...) 
via a standardized/NIH-backed data annotation scheme.


We are happy to hear your thoughts, suggestions on how to contribute, 
and also glad to set up dedicated discussions.


Cheers

Qianqian
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Proposal - Making ndarray object JSON serializable via standardized JData annotations

2021-11-25 Thread Qianqian Fang

On 11/25/21 17:05, Stephan Hoyer wrote:

Hi Qianqian,

What is your concrete proposal for NumPy here?

Are you suggesting new methods or functions like to_json/from_json in 
NumPy itself?



that would work - either define a subclass of JSONEncoder to serialize 
ndarray and allow users to pass it to cls in json.dump, or, as you 
mentioned, define to_json/from_json like pandas DataFrame 
<https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html> 
would save people from writing customized codes/formats.


I am also wondering if there is a more automated way to tell 
json.dump/dumps to use a default serializer for ndarray without using 
cls=...? I saw a SO post mentioned about a method called "__serialize__" 
in a class, but can't find it in the official doc. I am wondering if 
anyone is aware of the method defining a default json serializer in an 
object?



As far as I can tell, reading/writing in your custom JSON format 
already works with your jdata library.



ideally, I was hoping the small jdata encoder/decoder functions can be 
integrated into numpy; it can help avoid the "TypeError: Object of type 
ndarray is not JSON serializable" in json.dump/dumps without needing 
additional modules; more importantly, it simplifies users experience in 
exchanging complex arrays (complex valued, sparse, special shapes) with 
other programming environments.


Qianqian




Best,
Stephan

On Thu, Nov 25, 2021 at 2:35 PM Qianqian Fang  wrote:

Dear numpy developers,

I would like to share a proposal on making ndarray JSON
serializable by default, as detailed in this github issue:

https://github.com/numpy/numpy/issues/20461


briefly, my group and collaborators are working on a new NIH
(National Institute of Health) funded initiative - NeuroJSON
(http://neurojson.org) - to further disseminate a lightweight data
annotation specification (JData
<https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md>)
among the broad neuroimaging/scientific community. Python and
numpy have been widely used
<http://neuro.debian.net/_files/nipy-handout.pdf> in neuroimaging
data analysis pipelines (nipy, nibabel, mne-python, PySurfer ...
), because N-D array is THE most important data structure used in
scientific data. However, numpy currently does not support JSON
serialization by default. This is one of the frequently requested
features on github (#16432, #12481).

We have developed a lightweight python modules (jdata
<https://pypi.org/project/jdata/>, bjdata
<https://pypi.org/project/bjdata/>) to help export/import ndarray
objects to/from JSON (and a binary JSON format - BJData

<https://github.com/NeuroJSON/bjdata/blob/master/Binary_JData_Specification.md>/UBJSON
<http://ubjson.org/> - to gain efficiency). The approach is to
convert ndarray objects to a dictionary  with subfields using
standardized JData annotation tags. The JData spec can serialize
complex data structures such as N-D arrays (solid, sparse,
complex). trees, graphs, tables etc. It also permits data
compression. These annotations have been implemented in my MATLAB
toolbox - JSONLab <https://github.com/fangq/jsonlab> - since 2011
to help import/export MATLAB data types, and have been broadly
used among MATLAB/GNU Octave users.

Examples of these portable JSON annotation tags representing N-D
arrays can be found at


http://openjdata.org/wiki/index.cgi?JData/Examples/Basic#2_D_arrays_in_the_annotated_format
http://openjdata.org/wiki/index.cgi?JData/Examples/Advanced

and the detailed formats on N-D array annotations can be found in
the spec:


https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#annotated-storage-of-n-d-arrays


our current python module to encode/decode ndarray to JSON
serializable forms are implemented in these compact functions
(handling lossless type/data conversion and data compression)


https://github.com/NeuroJSON/pyjdata/blob/63301d41c7b97fc678fa0ab0829f76c762a16354/jdata/jdata.py#L72-L97

https://github.com/NeuroJSON/pyjdata/blob/63301d41c7b97fc678fa0ab0829f76c762a16354/jdata/jdata.py#L126-L160

We strongly believe that enabling JSON serialization by default
will benefit the numpy user community, making it a lot easier to
share complex data between platforms
(MATLAB/Python/C/FORTRAN/JavaScript...) via a
standardized/NIH-backed data annotation scheme.

We are happy to hear your thoughts, suggestions on how to
contribute, and also glad to set up dedicated discussions.

Cheers

Qianqian

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman

[Numpy-discussion] Re: Proposal - Making ndarray object JSON serializable via standardized JData annotations

2021-11-25 Thread Qianqian Fang

On 11/25/21 23:00, Robert Kern wrote:
We could also provide a JSONEncoder/JSONDecoder pair, too, but as I 
mention in one of the Github issues you link to, there are a number of 
different expectations that people could have for what the JSON 
representation of an array is. Some will want to use the JData 
standard. Others might just want the arrays to be represented as lists 
of lists of plain-old JSON numbers in order to talk with software in 
other languages that have no particular standard for array data.



hi Robert

I agree with you that different users have different expectations, but 
if you want, this can be accommodated by defining slightly different 
(builtin) subclasses of JSONEncoder functions and tell users what to 
expect when using those with cls= , or use one JSONEncoder and 
different parameters.


other projects like pandas.DataFrame handle this via a format parameter 
("orient") to the to_json() function


https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html


from my limited experience, as long as the "TypeError" from json keeps 
popping up, requests like this ( #12481, #16432, #18994, 
pallets/flask#4012, openmm/openmm#3202, zarr-developers/zarr-python#354) 
will unlikely to cease (and maintainers will have to keep on closing 
with "wontfix") - after all, no matter how different the format 
expectations a user may have, seeing some sort of default behavior is 
still a lot more satisfying than seeing an error.



It seems to me that the jdata package is the right place for 
implementing the JData standard. I'm happy for our documentation to 
point to it in all the places that we talk about serialization of 
arrays. If the json module did have some way for us to specify a 
default representation for our objects, then that would be a different 
matter. But for the present circumstances, I'm not seeing a 
substantial benefit to moving this code inside of numpy. Outside of 
numpy, you can evolve the JData standard at its own pace.


--
Robert Kern




I appreciate that you are willing to add this to the documentation, that 
is totally fine - I will just leave the links/resources here in case 
solving this issue becomes a priority in the future.



Qianqian



___
NumPy-Discussion mailing list --numpy-discussion@python.org
To unsubscribe send an email tonumpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address:fan...@gmail.com___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: An extension of the .npy file format

2022-08-25 Thread Qianqian Fang
I am curious what you and other developers think about adopting 
JSON/binary JSON as a similarly simple, reverse-engineering-able but 
universally parsable array exchange format instead of designing another 
numpy-specific binary format.


I am interested in this topic (as well as thoughts among numpy 
developers) because I am currently working on a project - NeuroJSON 
(https://neurojson.org) - funded by the US National Institute of Health. 
The goal of the NeuroJSON project is to create easy-to-adopt, 
easy-to-extend, and preferably human-readable data formats to help 
disseminate and exchange neuroimaging data (and scientific data in 
general).


Needless to say, numpy is a key toolkit that is widely used among 
neuroimaging data analysis pipelines. I've seen discussions of 
potentially adopting npy as a standardized way to share volumetric data 
(as ndarrays), such as in this thread


https://github.com/bids-standard/bids-specification/issues/197

however, several limitations were also discussed, for example

1. npy only support a single numpy array, does not support other 
metadata or other more complex data records (multiple arrays are only 
achieved via multiple files)

2. no internal (i.e. data-level) compression, only file-level compression
3. although the file is simple, it still requires a parser to 
read/write, and such parser is not widely available in other 
environments, making it mostly limited to exchange data among python 
programs
4. I am not entirely sure, but I suppose it does not support sparse 
matrices or special matrices (such as diagonal/band/symmetric etc) - I 
can be wrong though


In the NeuroJSON project, we primarily use JSON and binary JSON 
(specifically, UBJSON  derived BJData 
 format) as 
the underlying data exchange files. Through standardized data 
annotations 
, 
we are able to address most of the above limitations - the generated 
files are universally parsable in nearly all programming environments 
with existing parsers, support complex hierarchical data, compression, 
and can readily benefit from the large ecosystems of JSON (JSON-schema, 
JSONPath, JSON-LD, jq, numerous parsers, web-ready, NoSQL db ...).


I understand that simplicity is a key design spec here. I want to 
highlight UBJSON/BJData as a competitive alternative format. It is also 
designed with simplicity considered in the first place 
, yet, it allows to store hierarchical 
strongly-typed complex binary data and is easily extensible.


A UBJSON/BJData parser may not necessarily longer than a npy parser, for 
example, the python reader of the full spec only takes about 500 lines 
of codes (including comments), similarly for a JS parser


https://github.com/NeuroJSON/pybj/blob/master/bjdata/decoder.py
https://github.com/NeuroJSON/js-bjdata/blob/master/bjdata.js

We actually did a benchmark 
 a few months back - the 
test workloads are two large 2D numerical arrays (node, face to store 
surface mesh data), we compared parsing speed of various formats in 
Python, MATLAB, and JS. The uncompressed BJData (BMSHraw) reported a 
loading speed that is nearly as fast as reading raw binary dump; and 
internally compressed BJData (BMSHz) gives the best balance between 
small file sizes and loading speed, see our results here


https://pbs.twimg.com/media/FRPEdLGWYAEJe80?format=png&name=large

I want to add two quick points to echo the features you desired in npy:

1. it is not common to use mmap in reading JSON/binary JSON files, but 
it is certainly possible. I recently wrote a JSON-mmap spec 
 
and a MATLAB reference implementation 



2. UBJSON/BJData natively support append-able root-level records; JSON 
has been extensively used in data streaming with appendable nd-json or 
concatenated JSON (https://en.wikipedia.org/wiki/JSON_streaming)



just a quick comparison of output file sizes with a 1000x1000 unitary 
diagonal matrix


|# python3 -m pip install jdata bjdata||
||import numpy as np||
||import jdata as jd||
||x = np.eye(1000); *# create a large array*||
||y = np.vsplit(x, 5); *# split into smaller chunks*||
||np.save('eye5chunk.npy',y); *# save npy*||
||jd.save(y, 'eye5chunk_bjd_raw.jdb'); *# save as uncompressed bjd*||
||jd.save(y, 'eye5chunk_bjd_zlib.jdb', {'compression':'zlib'}); *# 
zlib-compressed bjd*||
||jd.save(y, 'eye5chunk_bjd_lzma.jdb', {'compression':'lzma'}); *# 
lzma-compressed bjd*||

||newy=jd.load('eye5chunk_bjd_zlib.jdb'); *# loading/decoding*||
||newx = np.concatenate(newy); *# regroup chunks*||
||newx.dtype|


here are the output file sizes in bytes:

|8000128  eye5chunk.npy||
||5004297  eye5chunk_bjd_raw.jdb||
||

[Numpy-discussion] Re: An extension of the .npy file format

2022-08-25 Thread Qianqian Fang

On 8/25/22 12:25, Robert Kern wrote:
No one is really proposing another format, just a minor tweak to the 
existing NPY format.


agreed. I was just following the previous comment on alternative formats 
(such as hdf5) and pros/cons of npy.



I don't quite know what this means. My installed version of `jq`, for 
example, doesn't seem to know what to do with these files.


❯ jq --version
jq-1.6

❯ jq . eye5chunk_bjd_raw.jdb
parse error: Invalid numeric literal at line 1, column 38



the .jdb files are binary JSON files (specifically BJData) that jq does 
not currently support; to save as text-based JSON, you change the suffix 
to .json or .jdt - it results in ~33% increase compared to the binary 
due to base64


jd.save(y, 'eye5chunk_bjd_zlib.jdt',  {'compression':'zlib'});

13694 Aug 25 12:54 eye5chunk_bjd_zlib.jdt
10338 Aug 25 15:41 eye5chunk_bjd_zlib.jdb

jq . eye5chunk_bjd_zlib.jdt

[
  {
    "_ArrayType_": "double",
    "_ArraySize_": [
  200,
  1000
    ],
    "_ArrayZipType_": "zlib",
    "_ArrayZipSize_": [
  1,
  20
    ],
    "_ArrayZipData_": "..."
   },
   ...

]


I think a fundamental problem here is that it looks like each element 
in the array is delimited. I.e. a `float64` value starts with b'D' 
then the 8 IEEE-754 bytes representing the number. When we're talking 
about memory-mappability, we are talking about having the on-disk 
representation being exactly what it looks like in-memory, all of the 
IEEE-754 floats contiguous with each other, so we can use the 
`np.memmap` `ndarray` subclass to represent the on-disk data as a 
first-class array object. This spec lets us mmap the binary JSON file 
and manipulate its contents in-place efficiently, but that's not what 
is being asked for here.



there are several BJData-compliant forms to store the same binary array 
losslessly. The most memory efficient and disk-mmapable (but not 
necessarily disk-efficient) form is to use the ND-array container syntax 
 
that BJData spec extended over UBJSON.


For example, a 100x200x300 3D float64 ($D) array can be stored as below 
(numbers are stored in binary forms, white spaces should be removed)


|[$D #[$u#U3 100 200 300 value0 value1 ...|

where the "value_i"s are contiguous (row-major) binary stream of the 
float64 buffer without the delimited marker ('D') because it is absorbed 
to the optimized header 
 of 
the array "[" following the type "$" marker. The data chunk is 
mmap-able, although if you desired a pre-determined initial offset, you 
can force the dimension vector (#[$u #U 3 100 200 300) to be an integer 
type ($u) large enough, for example uint32 (m), then the starting offset 
of the binary stream will be entirely predictable.


multiple ND arrays can be directly appended to the root level, for example,

|[$D #[$u#U3 100 200 300 value0 value1 ...||
||[$D #[$u#U3 100 200 300 value0 value1 ...||
||[$D #[$u#U3 100 200 300 value0 value1 ...||
||[$D #[$u#U3 100 200 300 value0 value1 ...|

can store 100x200x300 chunks of a 400x200x300 array

alternatively, one can also use an annotated format (in JSON form: 
|{"_ArrayType":"double","_ArraySize_":[100,200,300],"_ArrayData_":[value1,value2,...]}|) 
to store everything into 1D continuous buffer


|{|||U11 _ArrayType_ S U6 double |U11 _ArraySize_ [$u#U3 100 200 300 U11 
_ArrayData_ [$D #m 600 value1 value2 ...}|


The contiguous buffer in _ArrayData_ section is also disk-mmap-able; you 
can also make additional requirements for the array metadata to ensure a 
predictable initial offset, if desired.


similarly, these annotated chunks can be appended in either JSON or 
binary JSON forms, and the parsers can automatically handle both forms 
and convert them into the desired binary ND array with the expected type 
and dimensions.




here are the output file sizes in bytes:

|8000128  eye5chunk.npy||
||5004297  eye5chunk_bjd_raw.jdb|

Just a note: This difference is solely due to a special representation 
of `0` in 5 bytes rather than 8 (essentially, your encoder recognizes 
0.0 as a special value and uses the `float32` encoding of it). If you 
had any other value making up the bulk of the file, this would be 
larger than the NPY due to the additional delimiter b'D'.



the two BJData forms that I mentioned above (nd-array syntax or 
annotated array) will preserve the original precision/shape in 
round-trips. BJData follows the recommendations of the UBJSON spec and 
automatically reduces data size 
 
only if no precision loss (such as integer or zeros), but it is optional.




|  10338  eye5chunk_bjd_zlib.jdb||
||   2206  eye5chunk_bjd_lzma.jdb|

Qianqian

--
Robert Kern

___
NumPy-D

[Numpy-discussion] Exporting numpy arrays to binary JSON (BJData) for better portability

2022-08-25 Thread Qianqian Fang
To avoid derailing the other thread 
 
on extending .npy files, I am going to start a new thread on alternative 
array storage file formats using binary JSON - in case there is such a 
need and interest among numpy users


specifically, i want to first follow up with Bill's question below 
regarding loading time



On 8/25/22 11:02, Bill Ross wrote:


|Can you give load times for these?|



as I mentioned in the earlier reply to Robert, the most memory-efficient 
(i.e. fast loading, disk-mmap-able) but not necessarily disk-efficient 
(i.e. may result in the largest data file sizes) BJData construct to 
store an ND array using BJData's ND-array container.


I have to admit that both jdata and bjdata modules have not been 
extensively optimized for speed. with the current implementation, here 
are the loading time for a larger diagonal matrix (eye(1))


a BJData file storing a single eye(1) array using the ND-array 
container can be downloaded from here 
(file 
size: 1MB with zip, if decompressed, it is ~800MB, as the npy file) - 
this file was generated from a matlab encoder, but can be loaded using 
Python (see below Re Robert).


|80128 eye1e4.npy||
||80014 eye1e4_bjd_raw_ndsyntax.jdb||
||   813721 eye1e4_bjd_zlib.jdb||
||   113067 eye1e4_bjd_lzma.jdb|

the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9, numpy 
1.19.5) for each file is listed below:


|0.179s  eye1e4.npy (mmap_mode=None)||
||0.001s  eye1e4.npy (mmap_mode=r)||
||0.718s  eye1e4_bjd_raw_ndsyntax.jdb||
||1.474s  eye1e4_bjd_zlib.jdb||
||0.635s  eye1e4_bjd_lzma.jdb|


clearly, mmapped loading is the fastest option without a surprise; it is 
true that the raw bjdata file is about 5x slower than npy loading, but 
given the main chunk of the data are stored identically (as contiguous 
buffer), I suppose with some optimization of the decoder, the gap 
between the two can be substantially shortened. The longer loading time 
of zlib/lzma (and similarly saving times) reflects a trade-off between 
smaller file sizes and time for compression/decompression/disk-IO.


by no means I am saying the binary JSON format is readily available to 
deliver better speed with its current non-optimized implementation. I 
just want to bright the attention to this class of formats, and 
highlight that it's flexibility gives abundant mechanisms to create 
fast, disk-mapped IO, while allowing additional benefits such as 
compression, unlimited metadata for future extensions etc.




|> 8000128  eye5chunk.npy||
||> 5004297  eye5chunk_bjd_raw.jdb||
||>   10338  eye5chunk_bjd_zlib.jdb||
||>    2206  eye5chunk_bjd_lzma.jdb|

For my case, I'd be curious about the time to add one 1T-entries file 
to another.



as I mentioned in the previous reply, bjdata is appendable 
, 
so you can simply append another array (or a slice) to the end of the file.




Thanks,
Bill




also related, Re @Robert's question below

Are any of them supported by a Python BJData implementation? I didn't 
see any option to get that done in the `bjdata` package you 
recommended, for example.

https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a573358645ef/bjdata/encoder.py#L200


the bjdata module currently only support nd-array in the decoder 
 
(i.e. map such buffer to a numpy.ndarray) - should be relatively trivial 
to add it to the encoder though.


on the other side, the annotated format is currently supported. one can 
call jdata module (responsible for annotation-level encoding/decoding) 
as shown in my sample code, then call bjdata internally for data 
serialization.



Okay. Given your wording, it looked like you were claiming that the 
binary JSON was supported by the whole ecosystem. Rather, it seems 
like you can either get binary encoding OR the ecosystem support, but 
not both at the same time.


all in relative terms of course - json has ~100 listed parsers on it's 
website , MessagePack - another 
flavor of binary JSON - listed  ~50/60 
parsers, and UBJSON listed  ~20 parsers. 
I am not familiar with npy parsers, but googling it returns only a few.


also, most binary JSON implementations provided tools to convert to JSON 
and back, so, in that sense, whatever JSON has in its ecosystem can be 
"potentially" used for binary JSON files if one wants to. There are also 
recent publications comparing differences between various binary JSON 
formats in case anyone is interested


https://github.com/ubjson/universal-binary-json/issues/115
___
NumPy-Discussion 

[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

2022-08-25 Thread Qianqian Fang

On 8/25/22 18:33, Neal Becker wrote:




the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9,
numpy 1.19.5) for each file is listed below:

|0.179s  eye1e4.npy (mmap_mode=None)||
||0.001s  eye1e4.npy (mmap_mode=r)||
||0.718s  eye1e4_bjd_raw_ndsyntax.jdb||
||1.474s  eye1e4_bjd_zlib.jdb||
||0.635s  eye1e4_bjd_lzma.jdb|


clearly, mmapped loading is the fastest option without a
surprise; it is true that the raw bjdata file is about 5x slower
than npy loading, but given the main chunk of the data are stored
identically (as contiguous buffer), I suppose with some
optimization of the decoder, the gap between the two can be
substantially shortened. The longer loading time of zlib/lzma
(and similarly saving times) reflects a trade-off between smaller
file sizes and time for compression/decompression/disk-IO.


I think the load time for mmap may be deceptive, it isn't actually
loading anything, just mapping to memory.  Maybe a better
benchmark is to actually process the data, e.g., find the mean
which would require reading the values.



yes, that is correct, I meant to metion it wasn't an apple-to-apple 
comparison.


the loading times for fully-loading the data and printing the mean, by 
running the below line


|t=time.time(); newy=jd.load('eye1e4_bjd_raw_ndsyntax.jdb'); 
print(np.mean(newy)); t1=time.time() - t; print(t1)|


are summarized below (I also added lz4 compressed BJData/.jdb file via 
|jd.save(..., {'compression':'lz4'})|)


|0.236s  eye1e4.npy (mmap_mode=None)||- size: 80128 bytes
||0.120s  eye1e4.npy (mmap_mode=r)||
||0.764s  eye1e4_bjd_raw_ndsyntax.jdb||(with C extension _bjdata in 
sys.path) - size: 80014 bytes|

||0.599s  eye1e4_bjd_raw_ndsyntax.jdb||(without C extension _bjdata)|
||1.533s  eye1e4_bjd_zlib.jdb|||(without C extension _bjdata)|||  - 
size: 813721
||0.697s  eye1e4_bjd_lzma.jdb|||(without C extension _bjdata)  - size: 
113067
|0.918s  eye1e4_bjd_lz4.jdb|||(without C extension _bjdata)   - 
size: 3371487 bytes||



the mmapped loading remains to be the fastest, but the run-time is more 
realistic. I thought the lz4 compression would offer much faster 
decompression, but in this special workload, it isn't the case.


It is also interesting to see that the bjdata's C extension 
 did not help when 
parsing a single large array compared to the native python parser, 
suggesting rooms for further optimization|.|||

||

||
||

||Qianqian||

||
||
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

2022-08-27 Thread Qianqian Fang
)
time for reducing with jdb (lzma): 0.204s (3.66 GB/s)
time for reducing with blosc2 (blosclz): 0.043s (17.2 GB/s)
time for reducing with blosc2 (zstd): 0.072s (10.4 GB/s)
Total sum: 1.0

In this case we can notice that the combination of blosc2+blosclz 
achieves speeds that are faster than using a plain numpy array.  
Having disk I/O going faster than memory is strange enough, but if we 
take into account that these arrays compress extremely well (more than 
1000x in this case), then the I/O overhead is really low compared with 
the cost of computation (all the decompression takes place in CPU 
cache, not memory), so in the end, this is not that surprising.


Cheers!


On Fri, Aug 26, 2022 at 4:26 AM Qianqian Fang  wrote:

On 8/25/22 18:33, Neal Becker wrote:




the loading time (from an nvme drive, Ubuntu 18.04, python
3.6.9, numpy 1.19.5) for each file is listed below:

|0.179s  eye1e4.npy (mmap_mode=None)||
||0.001s  eye1e4.npy (mmap_mode=r)||
||0.718s  eye1e4_bjd_raw_ndsyntax.jdb||
||1.474s  eye1e4_bjd_zlib.jdb||
||0.635s  eye1e4_bjd_lzma.jdb|


clearly, mmapped loading is the fastest option without a
surprise; it is true that the raw bjdata file is about 5x
slower than npy loading, but given the main chunk of the
data are stored identically (as contiguous buffer), I
suppose with some optimization of the decoder, the gap
between the two can be substantially shortened. The longer
loading time of zlib/lzma (and similarly saving times)
reflects a trade-off between smaller file sizes and time for
compression/decompression/disk-IO.


I think the load time for mmap may be deceptive, it isn't
actually loading anything, just mapping to memory.  Maybe a
better benchmark is to actually process the data, e.g., find
the mean which would require reading the values.



yes, that is correct, I meant to metion it wasn't an
apple-to-apple comparison.

the loading times for fully-loading the data and printing the
mean, by running the below line

|t=time.time(); newy=jd.load('eye1e4_bjd_raw_ndsyntax.jdb');
print(np.mean(newy)); t1=time.time() - t; print(t1)|

are summarized below (I also added lz4 compressed BJData/.jdb file
via |jd.save(..., {'compression':'lz4'})|)

|0.236s  eye1e4.npy (mmap_mode=None)||- size: 80128 bytes
||0.120s  eye1e4.npy (mmap_mode=r)||
||0.764s  eye1e4_bjd_raw_ndsyntax.jdb||(with C extension _bjdata
in sys.path) - size: 80014 bytes|
||0.599s  eye1e4_bjd_raw_ndsyntax.jdb||(without C extension _bjdata)|
||1.533s  eye1e4_bjd_zlib.jdb|||(without C extension _bjdata)||| 
- size: 813721
||0.697s  eye1e4_bjd_lzma.jdb|||(without C extension _bjdata)  -
size: 113067
|0.918s eye1e4_bjd_lz4.jdb|||(without C extension _bjdata)   -
size: 3371487 bytes||


the mmapped loading remains to be the fastest, but the run-time is
more realistic. I thought the lz4 compression would offer much
faster decompression, but in this special workload, it isn't the case.

It is also interesting to see that the bjdata's C extension
<https://github.com/NeuroJSON/pybj/tree/master/src> did not help
when parsing a single large array compared to the native python
parser, suggesting rooms for further optimization|.|||
||

||
||

||Qianqian||

||
||

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: fal...@gmail.com



--
Francesc Alted

___
NumPy-Discussion mailing list --numpy-discussion@python.org
To unsubscribe send an email tonumpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address:fan...@gmail.com___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

2022-08-31 Thread Qianqian Fang

On 8/30/22 06:29, Francesc Alted wrote:


Not exactly.  What we've done is to encode the header and the trailer 
(i.e. where the metadata is) of the frame with msgpack.  Thechunks 
section 
is 
where the actual data is; this section does not follow a msgpack 
structure as such, but it is rather a sequence of data chunks and an 
index (for quickly locating the chunks).  You can easily access the 
header or trailer sections reading from the start or the end of the 
frame.  This way you don't need to update the indexes of chunks in 
msgpack, which can be expensive during data updates.


This indeed prevents data to be dumped by using typical msgpack tools, 
but our sense is that users should care mostly about metainfo, and let 
the libraries to deal with the actual data in the most efficient way.



thanks for your detailed reply. I spent the past few days reading the 
links/documentations, as well as experimenting the blosc2 
meta-compressors, I was quite impressed by the performance of blosc2. I 
was also happy to see great alignments behind the drives for Caterva 
those of NeuroJSON.


I have a few quick updates

1. I added blosc2 as a codec in my jdata module, as an alternative 
compressor to zlib/lzma/lz4


https://github.com/NeuroJSON/pyjdata/commit/ce25fa53ce73bf4cbe2cff9799b5a616e2cd75cb

2. as I mentioned, jdata/bjdata were not optimized for speed, they 
contain many inefficient handling of numpy arrays (as I discovered); 
after some profiling, I was able to remove most of those, the run-time 
is now nearly entirely spent in compression/decompression (see attached 
profiler outputs for the `zlib` compressor benchmark)


3. the new jdata that supports blosc2, v0.5.0, has been tagged and 
uploaded (https://pypi.org/project/jdata)


4. I wrote a script and compared the run times of various codecs (using 
BJData and JSON as containers) , the code can be found here


https://github.com/NeuroJSON/pyjdata/blob/master/test/benchcodecs.py

the save/load times tested on a Ryzen 9 3950X/Ubuntu 18.04 box (at 
various threads) are listed below (similar to your posted before)



*|- Testing npy/npz|*|
||  'npy',    'save' 0.2914195 'load' 0.1963226 'size'  80128||
||  'npz',    'save' 2.8617918 'load' 1.9550347 'size'  813846|

*|- Testing text-based JSON files (.jdt)|**|*|(nthread=8)|*...|*|
||  'zlib',   'save' 2.5132861 'load' 1.7221164 'size'  1084942||
||  'lzma',   'save' 9.5481696 'load' 0.3865211 'size'  150738||
||  'lz4',    'save' 0.3467197 'load' 0.5019965 'size'  4495297||
||  'blosc2blosclz'save' 0.0165646 'load' 0.1143934 'size'  1092747||
||  'blosc2lz4',  'save' 0.0175058 'load' 0.1015181 'size'  1090159||
||  'blosc2lz4hc','save' 0.2102167 'load' 0.1053235 'size'  4315421||
||  'blosc2zlib', 'save' 0.1002635 'load' 0.1188845 'size'  1270252||
||  'blosc2zstd', 'save' 0.0463817 'load' 0.1017909 'size'  253176|
||

*||**|- Testing binary JSON (BJData) files (.jdb) (nthread=8)...|*|
||  'zlib',   'save' 2.4401443 'load' 1.6316463 'size'  813721||
||  'lzma',   'save' 9.3782029 'load' 0.3728334 'size'  113067||
||  'lz4',    'save' 0.3389360 'load' 0.5017435 'size'  3371487||
||  'blosc2blosclz'save' 0.0173912 'load' 0.1042985 'size'  819576||
||  'blosc2lz4',  'save' 0.0133688 'load' 0.1030941 'size'  817635||
||  'blosc2lz4hc','save' 0.1968047 'load' 0.0950071 'size'  3236580||
||  'blosc2zlib', 'save' 0.1023218 'load' 0.1083922 'size'  952705||
||  'blosc2zstd', 'save' 0.0468430 'load' 0.1019175 'size'  189897
|||

*- Testing binary JSON (BJData) files (.jdb) ||*|*||(nthread=1)|...|*|
|  'blosc2blosclz'save' 0.0883078 'load' 0.2432985 'size'  819576
  'blosc2lz4',  'save' 0.0867996 'load' 0.2394990  'size' 817635
  'blosc2lz4hc','save' 2.4794559 'load' 0.2498981  'size' 3236580
  'blosc2zlib', 'save' 0.7477457 'load' 0.4873921  'size' 952705
  'blosc2zstd', 'save' 0.3435547 'load' 0.3754863  'size' 189897
|
|*- Testing binary JSON (BJData) files (.jdb) ||*|*||(nthread=32)|...|*|
||  'blosc2blosclz'save' 0.0197186 'load' 0.1410989  'size'  819576
  'blosc2lz4',  'save' 0.0168068 'load' 0.1414074  'size' 817635
  'blosc2lz4hc','save' 0.0790011 'load' 0.0935394  'size' 3236580
  'blosc2zlib', 'save' 0.0608818 'load' 0.0985531  'size' 952705
  'blosc2zstd', 'save' 0.0370790 'load' 0.0945577  'size' 189897

|

a few observations:

1. single-threaded zlib/lzma are relatively slow, reflected by npz, zlib 
and lzma results


2. for simple data structure like this one, using JSON/text-based 
wrapper vs a binary wrapper has a marginal difference in speed; the only 
penalty is that text/JSON is ~33% larger than binary in size due to base64


3. blosc2 overall delivered very impressive speed - even in single 
thread, it can be than faster than uncompressed npz or other standard 
compression methods


4. several blosc2 compressors scaled well with more threads

5. it is a bit strang